rdfhdt / hdt-java

HDT Java library and tools.
Other
94 stars 69 forks source link

the ByteStringUtil.longestCommonPrefix(...) method isn't working between non ascii String and internal CharSequence #165

Open ate47 opened 1 year ago

ate47 commented 1 year ago

The ByteStringUtil.longestCommonPrefix(...) method isn't working when one of its parameters is a String and the other a Compact or Replazable String, in the internal strings (Replazable/Compact), the charAt(i) methods are returning byte[i] and in a string, it returns the character at location i, so if we are using non ASCII characters, we are using more than one byte. For example (Shorten value of a Wikidata literal of Q101213907)

String s1 = "\u00C2\u00A0normal";
CompactString s2 = new CompactString("\u00A0normal");
Assert.assertEquals(0, ByteStringUtil.longestCommonPrefix(s1, s2));
// java.lang.AssertionError: 
// Expected :0
// Actual   :8

The string value is "\u00C2\u00A0" = char[] {0xC2, 0xA0} The internal value is utf8("\u00A0") = byte[] {0xC2, 0XA0}

cf: UTF8

In the code, it is used 2 internal strings, but because the method is public, it might be better to fix it if someone is using the library method.