rdfhdt / hdt-java

HDT Java library and tools.
Other
94 stars 68 forks source link

hdtCat error in LongArrayDisk with large files #211

Open balhoff opened 2 weeks ago

balhoff commented 2 weeks ago

I'm trying to merge two HDT files using hdtCat.sh. Each file has more than 13 billion triples:

After about 25 hours I get this error:

Exception in thread “main” java.lang.ArrayIndexOutOfBoundsException: Index -4 out of bounds for length 29
    at org.rdfhdt.hdt.util.disk.LongArrayDisk.get(LongArrayDisk.java:116)
    at org.rdfhdt.hdt.dictionary.impl.utilCat.CatMappingBack.set(CatMappingBack.java:77)
    at org.rdfhdt.hdt.dictionary.impl.FourSectionDictionaryCat.cat(FourSectionDictionaryCat.java:244)
    at org.rdfhdt.hdt.hdt.impl.HDTImpl.cat(HDTImpl.java:486)
    at org.rdfhdt.hdt.hdt.HDTManagerImpl.doHDTCat(HDTManagerImpl.java:329)
    at org.rdfhdt.hdt.hdt.HDTManager.catHDT(HDTManager.java:642)
    at org.rdfhdt.hdt.tools.HDTCat.cat(HDTCat.java:82)
    at org.rdfhdt.hdt.tools.HDTCat.execute(HDTCat.java:116)
    at org.rdfhdt.hdt.tools.HDTCat.main(HDTCat.java:184)

I tried both v3.0.10 and v3.0.9 with the same result. I can provide these files, but each is about 170 GB. I haven't run into this issue with any smaller files.

D063520 commented 2 weeks ago

Hi, could you try out this:

https://github.com/the-qa-company/qEndpoint/wiki/qEndpoint-CLI-commands#hdtdiffcat-qep-specific

it is an evolution of the tool ....

balhoff commented 2 weeks ago

@D063520 thank you for pointing that out, I hadn't come across it yet. I'm trying it now.

balhoff commented 2 weeks ago

@D063520 the qEndpoint tool worked! It seems a good bit faster as well, but it uses quite a bit more RAM. I had originally been using a max heap of 150 GB, but ended up increasing it 3 times until it worked with a 400 GB heap. Now I've got an HDT file containing 27.5 billion triples.

balhoff commented 2 weeks ago

@D063520 actually I used hdtCat.sh from your package, rather than hdtDiffCat. Are these different?

D063520 commented 1 week ago

@ate47

ate47 commented 1 week ago

If you have the -kcat it's the same, otherwise by default the qep cli is using the disk optimized version and the rdfhdt cli the memory version. The memory one is slow and not efficient