Closed stackoverflow closed 3 weeks ago
According to https://docs.oracle.com/en/java/javase/22/docs/api/java.base/java/lang/String.html
A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.
substring
works on char's (utf-16 code units):
So the utf-8 byte offset is the correct utf-16 index, as long all chars before the current position are just one utf-8 code-unit each.
I think Tree
needs to keep track of the encoded byte[]
that was passed to treesitter. If the used InputEncoding
is also known, Tree
would not need to store the String.
Maybe something, like this, could be used in Tree
instead of the String
for source
. (So we can reuse the same byte[]
treesitter uses):
public record Source(byte[] content, InputEncoding inputEncoding) {
public Source(String content, InputEncoding inputEncoding) {
this(content.getBytes(inputEncoding.charset()), inputEncoding);
}
public Source(String content) {
this(content, InputEncoding.UTF_8);
}
public String subStringByte(int startByte, int endByte) {
int length = Math.min(endByte, content.length) - startByte;
return new String(content, startByte, length, inputEncoding.charset());
}
@Override
public final String toString() {
return new String(content, inputEncoding.charset());
}
}
@stackoverflow Do you have a minimal reproducible example?
@stackoverflow Do you have a minimal reproducible example?
Not in Java. We have one in our tests which makes sure my fix works. You can use the same idea, using a unicode char like © to throw the index off and check the node's text.
I ran into a problem where the
getText
method fromNode
was returning a wrong result, offset by some value.The problem seems to be I have some multi-byte UTF-8 characters in my file and the way
getText
works is by callingsubstring
using thegetStartByte
andgetEndByte
as offsets, butsubstring
works on characters, not on bytes, so if you have some UTF-8 chars in your text that take more than 1 bytegetText
will be incorrect.I solved this on my side by using
String.getBytes(Charset)
to get a unicode-aware byte array and use the String constructor that takes a byte array and a byte offset (String(byte[] bytes, int offset, int length, Charset charset)
).I don't like this solution because it allocates an array the size of the whole source code every time I get the node's text.