rdfhdt / hdt-java

HDT Java library and tools.
Other
94 stars 69 forks source link

Byte strings aren't able to compare UTF32 strings #177

Open ate47 opened 1 year ago

ate47 commented 1 year ago

I've noticed that if we take characters with surrogate (To have UTF-32), for example this symbol: 𦳣 and another one without surrogate, for example this symbol: , we don't have the same results.

The code to reproduce is here, I've used the code points to get the strings.

String ss1 = new String(Character.toChars(0x26ce3)); // 𦳣
String ss2 = new String(Character.toChars(0xf4d1)); // 

CompactString b1 = new CompactString(ss1);
CompactString b2 = new CompactString(ss2);

assertEquals(ss1, b1.toString());
assertEquals(ss2, b2.toString());

// I clamp the value between -1 and 1 to have the same result
int cmpByte = Math.max(-1, Math.min(1, b1.compareTo(b2)));
int cmpStr = Math.max(-1, Math.min(1, b1.toString().compareTo(b2.toString())));

assertEquals(cmpStr, cmpByte);
// java.lang.AssertionError: 
// Expected :-1
// Actual   :1

It creates a bug with the generation of an HDT of a section of Wikidata

> .\rdf2hdt.bat .\chunk.nt.gz test.hdt
...
File converted in: 2 min 30 sec 463 ms 185 us
Total Triples: 49996305
Different subjects: 1206364
Different predicates: 3655
Different objects: 9917883
Common Subject/Object:603515
HDT saved to file in: 1 sec 242 ms 73 us

> .\hdtVerify.bat .\test.hdt
Checking subject entries
Checking predicate entries
Checking object entries
ERRA: "????"@zh-hant / "??"@lzh
ERRB: "????"@zh-hant / "??"@lzh
ERRA: "????"@zh-hant / "???"@lzh
ERRB: "????"@zh-hant / "???"@lzh
ERRA: "???????"@zh-hant / "?????"@got
ERRB: "???????"@zh-hant / "?????"@got
Checking shared entries