rdfhdt / hdt-java

HDT Java library and tools.
Other
94 stars 69 forks source link

Disk co-indexing with KWay merge #187

Closed ate47 closed 1 year ago

ate47 commented 1 year ago

This PR contains a version of the first generateIndex method for the co-index (.hdt.index* file), but implemented using the k-ways merge sort method of #162, I was trying to index Wikidata nt file (17.5B triples) with my machine with only 16GB of RAM and noticed that even with the optimized implementation of #140 with the disk version #178, it wasn't able to handle the large amount of triples (too many random accesses)

With this version I was able to create the co-index in 5h17min. (the other implementation was stopped after 8h)

A lot of changes were made, I tried to optimize the previous implementation before working on the new one.

API changes

Add default implementations for HDTOptions and add the options to config the indexing:

CLI changes

add -options [opt], -config [file] and -color to hdtSearch.

CORE changes

BitmapXBig

Implementation of Bitmap375Big and Bitmap64Big, working with both disk and memory bitmaps, use LargeArrays if required, removing the limit of 128B elements and allowing the Bitmap375 to have a disk co-index. The previous implementations are now deprecated and were replaced in the library with the new implementation.

LongLargeArray bug

A bug was found in LongLargeArray preventing a fast 0 set of the arrays, knowing how not updated this library is, a fix was added to IOUtil.

LongArrayDisk bug

Another bug to the LongArrayDisk was found, a fix is contained in this method.

// LongArrayDisk#192
mapping.position(startBuffer); // before
mapping.position(startBuffer + shift); // after

KWay sort to create index

The legacy code was using ArrayLists to sort the seqZ/bitmapZ/seqY ids to create the object index, but it was taking a lot of memory, the optimized implementation was using bitmap375 to use the rank/select operations to reduce the memory usage, but it was still a lot of memory and the accesses were more random so a disk version can't work. So I used the kway merger of the disk generation method to sort like with the legacy version.

To store the chunks during the sort, I've applied a basic compression to store the ids with the delta instead of the plain values, it saved 120GB for a 17.5B seqZ, leading to max chunk of 80GB, but maybe it can be optimized to reduce the time/disk usage.

Tests

Obviously everything is tested with unit tests.

EDIT

After reindeing the wikidata HDT, I have these results, more accurate than the 5h described previously:

2023-01-18 19:33:43.583  INFO 3308 --- [nio-1234-exec-1] o.rdfhdt.hdt.triples.impl.BitmapTriples  : Pair sorted in 2 hour 36 min 13 sec 277 ms 753 us
2023-01-18 20:18:30.249  INFO 3308 --- [nio-1234-exec-1] o.rdfhdt.hdt.triples.impl.BitmapTriples  : indexZ/bitmapIndexZ completed in 44 min 46 sec 656 ms 818 us
2023-01-18 20:24:59.594  INFO 3308 --- [nio-1234-exec-1] o.rdfhdt.hdt.triples.impl.BitmapTriples  : Predicate count completed in 6 min 29 sec 327 ms 128 us
2023-01-18 20:24:59.604  INFO 3308 --- [nio-1234-exec-1] o.rdfhdt.hdt.triples.impl.BitmapTriples  : Index generated in 6 min 29 sec 340 ms 405 us
2023-01-18 20:52:32.084  INFO 3308 --- [nio-1234-exec-1] o.r.h.triples.impl.PredicateIndexArray   : Predicate Bitmap in 27 min 32 sec 364 ms 128 us
2023-01-18 21:49:38.413  INFO 3308 --- [nio-1234-exec-1] o.r.h.triples.impl.PredicateIndexArray   : Count predicates in 57 min 6 sec 304 ms 620 us
Index generated and saved in 5 hour 30 min 34 sec 753 ms 598 us
D063520 commented 1 year ago

great, thank you for this contribution!!!!