rdfhdt / hdt-java

HDT Java library and tools.
Other
94 stars 69 forks source link

Unsafe memory access #183

Closed Fukoros closed 1 year ago

Fukoros commented 1 year ago

Hi, First of all many thanks for this great tool! We have a question concerning a JAVA error when using the HDT library to convert a large dataset to HDT, and we were wondering if you can help us with this issue. While trying to convert YAGO to HDT format using multiple divided files to reduce the usage of the RAM (the pipeline is fully described in the file Create Data.txt), in the final concatenation between the HDT files where we also ask to create the index, the HDT merged was successful while the creation of the index yielded this error :

Exception in thread "main" java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code at org.rdfhdt.hdt.util.disk.LongArrayDisk.(LongArrayDisk.java:79) at org.rdfhdt.hdt.util.disk.LongArrayDisk.(LongArrayDisk.java:51) at org.rdfhdt.hdt.util.disk.LongArrayDisk.(LongArrayDisk.java:44) at org.rdfhdt.hdt.compact.sequence.SequenceLog64BigDisk.(SequenceLog64BigDisk.java:60) at org.rdfhdt.hdt.triples.impl.BitmapTriples.createIndexObjectMemoryEfficient(BitmapTriples.java:448) at org.rdfhdt.hdt.triples.impl.BitmapTriples.generateIndex(BitmapTriples.java:661) at org.rdfhdt.hdt.hdt.impl.HDTImpl.loadOrCreateIndex(HDTImpl.java:538) at org.rdfhdt.hdt.hdt.HDTManagerImpl.doIndexedHDT(HDTManagerImpl.java:87) at org.rdfhdt.hdt.hdt.HDTManager.indexedHDT(HDTManager.java:261) at org.rdfhdt.hdt.tools.HDTCat.execute(HDTCat.java:109) at org.rdfhdt.hdt.tools.HDTCat.main(HDTCat.java:152)

The previous pipeline used the 3.0.5 release. So we tried to use the latest release to generate the index of this final file but it yielded the same type of issue :

./bin/hdtSearch.sh ./../Yago/test/final_merge.hdt Predicate Bitmap in 22 sec 863 ms 296 us Count predicates in 1 min 22 sec 759 ms 978 us Count Objects in 2 min 57 sec 972 ms 674 us Max was: 66908179 Bitmap in 9 sec 119 ms 620 us Exception in thread "main" java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code at org.rdfhdt.hdt.triples.impl.BitmapTriples.createIndexObjectMemoryEfficient(BitmapTriples.java:563) at org.rdfhdt.hdt.triples.impl.BitmapTriples.generateIndex(BitmapTriples.java:786) at org.rdfhdt.hdt.hdt.impl.HDTImpl.loadOrCreateIndex(HDTImpl.java:405) at org.rdfhdt.hdt.hdt.HDTManagerImpl.doMapIndexedHDT(HDTManagerImpl.java:107) at org.rdfhdt.hdt.hdt.HDTManager.mapIndexedHDT(HDTManager.java:212) at org.rdfhdt.hdt.tools.HdtSearch.execute(HdtSearch.java:142) at org.rdfhdt.hdt.tools.HdtSearch.main(HdtSearch.java:205)

We are using a server with Ubuntu 22.0.4.1 LTS with 128 GO of ram, a main disk of 16 GO and a ssd disk of 1 TO where the data is fully stored.

ate47 commented 1 year ago

I'm not sure, but it might be because your main disk is too small, for now you can't specify indexing options with the CLI. If you really need it, you need to add these options to your hdt creation script:

bitmaptriples.sequence.disk.location=indextmp
bitmaptriples.sequence.disk=true

It can be via your last HDTCat with the the argument -options "bitmaptriples.sequence.disk.location=indextmp;bitmaptriples.sequence.disk=true"

Or via the code, you import the CORE+API and you run this code:

String location = "youHdtFile.hdt";

HDTOptions spec = new HDTOptionsBase();
// indextmp is the location where you want to use the disk
spec.set("bitmaptriples.sequence.disk.location", "indextmp");
spec.set("bitmaptriples.sequence.disk", "true");

HDTManager.mapIndexedHDT(location, spec, null).close();

Edit for the devs: It is because of this line:

https://github.com/rdfhdt/hdt-java/blob/febc5677793217866297d9012b67d28181c7974c/hdt-java-core/src/main/java/org/rdfhdt/hdt/triples/impl/BitmapTriples.java#L548

The object array is forced to be on disk, but the default temporary file will be in the /tmp dir, so in the main disk

Fukoros commented 1 year ago

Thanks for your help it solved the issue.