I've noticed that if we take characters with surrogate (To have UTF-32), for example this symbol: 𦳣 and another one without surrogate, for example this symbol: , we don't have the same results.
The code to reproduce is here, I've used the code points to get the strings.
String ss1 = new String(Character.toChars(0x26ce3)); // 𦳣
String ss2 = new String(Character.toChars(0xf4d1)); //
CompactString b1 = new CompactString(ss1);
CompactString b2 = new CompactString(ss2);
assertEquals(ss1, b1.toString());
assertEquals(ss2, b2.toString());
// I clamp the value between -1 and 1 to have the same result
int cmpByte = Math.max(-1, Math.min(1, b1.compareTo(b2)));
int cmpStr = Math.max(-1, Math.min(1, b1.toString().compareTo(b2.toString())));
assertEquals(cmpStr, cmpByte);
// java.lang.AssertionError:
// Expected :-1
// Actual :1
It creates a bug with the generation of an HDT of a section of Wikidata
> .\rdf2hdt.bat .\chunk.nt.gz test.hdt
...
File converted in: 2 min 30 sec 463 ms 185 us
Total Triples: 49996305
Different subjects: 1206364
Different predicates: 3655
Different objects: 9917883
Common Subject/Object:603515
HDT saved to file in: 1 sec 242 ms 73 us
> .\hdtVerify.bat .\test.hdt
Checking subject entries
Checking predicate entries
Checking object entries
ERRA: "????"@zh-hant / "??"@lzh
ERRB: "????"@zh-hant / "??"@lzh
ERRA: "????"@zh-hant / "???"@lzh
ERRB: "????"@zh-hant / "???"@lzh
ERRA: "???????"@zh-hant / "?????"@got
ERRB: "???????"@zh-hant / "?????"@got
Checking shared entries
I've noticed that if we take characters with surrogate (To have UTF-32), for example this symbol: 𦳣 and another one without surrogate, for example this symbol: , we don't have the same results.
The code to reproduce is here, I've used the code points to get the strings.
It creates a bug with the generation of an HDT of a section of Wikidata