tree-sitter / java-tree-sitter

Java bindings to the Tree-sitter parsing library
https://tree-sitter.github.io/java-tree-sitter/
MIT License
27 stars 6 forks source link

`Node.getText()` is not encoding aware #27

Closed stackoverflow closed 3 weeks ago

stackoverflow commented 2 months ago

I ran into a problem where the getText method from Node was returning a wrong result, offset by some value.

The problem seems to be I have some multi-byte UTF-8 characters in my file and the way getText works is by calling substring using the getStartByte and getEndByte as offsets, but substring works on characters, not on bytes, so if you have some UTF-8 chars in your text that take more than 1 byte getText will be incorrect.

I solved this on my side by using String.getBytes(Charset) to get a unicode-aware byte array and use the String constructor that takes a byte array and a byte offset (String(byte[] bytes, int offset, int length, Charset charset)).

I don't like this solution because it allocates an array the size of the whole source code every time I get the node's text.

Sigma42 commented 2 months ago

According to https://docs.oracle.com/en/java/javase/22/docs/api/java.base/java/lang/String.html

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

substring works on char's (utf-16 code units):

So the utf-8 byte offset is the correct utf-16 index, as long all chars before the current position are just one utf-8 code-unit each.

I think Tree needs to keep track of the encoded byte[] that was passed to treesitter. If the used InputEncoding is also known, Tree would not need to store the String.

Maybe something, like this, could be used in Tree instead of the String for source. (So we can reuse the same byte[] treesitter uses):

public record Source(byte[] content, InputEncoding inputEncoding) {

    public Source(String content, InputEncoding inputEncoding) {
        this(content.getBytes(inputEncoding.charset()), inputEncoding);
    }

    public Source(String content) {
        this(content, InputEncoding.UTF_8);
    }

    public String subStringByte(int startByte, int endByte) {
        int length = Math.min(endByte, content.length) - startByte;

        return new String(content, startByte, length, inputEncoding.charset());
    }

    @Override
    public final String toString() {
        return new String(content, inputEncoding.charset());
    }
}
ObserverOfTime commented 4 weeks ago

@stackoverflow Do you have a minimal reproducible example?

stackoverflow commented 3 weeks ago

@stackoverflow Do you have a minimal reproducible example?

Not in Java. We have one in our tests which makes sure my fix works. You can use the same idea, using a unicode char like © to throw the index off and check the node's text.