This PR contains a version of the first generateIndex method for the co-index (.hdt.index* file), but implemented using the k-ways merge sort method of #162, I was trying to index Wikidata nt file (17.5B triples) with my machine with only 16GB of RAM and noticed that even with the optimized implementation of #140 with the disk version #178, it wasn't able to handle the large amount of triples (too many random accesses)
With this version I was able to create the co-index in 5h17min. (the other implementation was stopped after 8h)
A lot of changes were made, I tried to optimize the previous implementation before working on the new one.
API changes
Add default implementations for HDTOptions and add the options to config the indexing:
bitmaptriples.indexmethod - Indexing method for the bitmap triples, can be used with:
recommended - Recommended implementation, default value
optimized - Memory optimized option (current default for recommended)
legacy - Legacy implementation, fast, but memory inefficient
disk - Disk option, handle the indexing on disk to reduce usage
bitmaptriples.indexmethod.disk.compressWorker - Number of core used to index the HDT with disk index method.
bitmaptriples.indexmethod.disk.chunkSize - Maximum size of a chunk for the disk index method
bitmaptriples.indexmethod.disk.fileBufferSize - Size of the file buffers for the disk index method
bitmaptriples.indexmethod.disk.maxFileOpen - Maximum number of file disk index method can open at the same time
bitmaptriples.indexmethod.disk.kway - log of the number of way the system can merge in disk index method
CLI changes
add -options [opt], -config [file] and -color to hdtSearch.
CORE changes
BitmapXBig
Implementation of Bitmap375Big and Bitmap64Big, working with both disk and memory bitmaps, use LargeArrays if required, removing the limit of 128B elements and allowing the Bitmap375 to have a disk co-index. The previous implementations are now deprecated and were replaced in the library with the new implementation.
Another bug to the LongArrayDisk was found, a fix is contained in this method.
// LongArrayDisk#192
mapping.position(startBuffer); // before
mapping.position(startBuffer + shift); // after
KWay sort to create index
The legacy code was using ArrayLists to sort the seqZ/bitmapZ/seqY ids to create the object index, but it was taking a lot of memory, the optimized implementation was using bitmap375 to use the rank/select operations to reduce the memory usage, but it was still a lot of memory and the accesses were more random so a disk version can't work. So I used the kway merger of the disk generation method to sort like with the legacy version.
To store the chunks during the sort, I've applied a basic compression to store the ids with the delta instead of the plain values, it saved 120GB for a 17.5B seqZ, leading to max chunk of 80GB, but maybe it can be optimized to reduce the time/disk usage.
Tests
Obviously everything is tested with unit tests.
EDIT
After reindeing the wikidata HDT, I have these results, more accurate than the 5h described previously:
2023-01-18 19:33:43.583 INFO 3308 --- [nio-1234-exec-1] o.rdfhdt.hdt.triples.impl.BitmapTriples : Pair sorted in 2 hour 36 min 13 sec 277 ms 753 us
2023-01-18 20:18:30.249 INFO 3308 --- [nio-1234-exec-1] o.rdfhdt.hdt.triples.impl.BitmapTriples : indexZ/bitmapIndexZ completed in 44 min 46 sec 656 ms 818 us
2023-01-18 20:24:59.594 INFO 3308 --- [nio-1234-exec-1] o.rdfhdt.hdt.triples.impl.BitmapTriples : Predicate count completed in 6 min 29 sec 327 ms 128 us
2023-01-18 20:24:59.604 INFO 3308 --- [nio-1234-exec-1] o.rdfhdt.hdt.triples.impl.BitmapTriples : Index generated in 6 min 29 sec 340 ms 405 us
2023-01-18 20:52:32.084 INFO 3308 --- [nio-1234-exec-1] o.r.h.triples.impl.PredicateIndexArray : Predicate Bitmap in 27 min 32 sec 364 ms 128 us
2023-01-18 21:49:38.413 INFO 3308 --- [nio-1234-exec-1] o.r.h.triples.impl.PredicateIndexArray : Count predicates in 57 min 6 sec 304 ms 620 us
Index generated and saved in 5 hour 30 min 34 sec 753 ms 598 us
This PR contains a version of the first generateIndex method for the co-index (.hdt.index* file), but implemented using the k-ways merge sort method of #162, I was trying to index Wikidata nt file (17.5B triples) with my machine with only 16GB of RAM and noticed that even with the optimized implementation of #140 with the disk version #178, it wasn't able to handle the large amount of triples (too many random accesses)
With this version I was able to create the co-index in 5h17min. (the other implementation was stopped after 8h)
A lot of changes were made, I tried to optimize the previous implementation before working on the new one.
API changes
Add default implementations for HDTOptions and add the options to config the indexing:
bitmaptriples.indexmethod
- Indexing method for the bitmap triples, can be used with:recommended
- Recommended implementation, default valueoptimized
- Memory optimized option (current default for recommended)legacy
- Legacy implementation, fast, but memory inefficientdisk
- Disk option, handle the indexing on disk to reduce usagebitmaptriples.indexmethod.disk.compressWorker
- Number of core used to index the HDT withdisk
index method.bitmaptriples.indexmethod.disk.chunkSize
- Maximum size of a chunk for thedisk
index methodbitmaptriples.indexmethod.disk.fileBufferSize
- Size of the file buffers for thedisk
index methodbitmaptriples.indexmethod.disk.maxFileOpen
- Maximum number of filedisk
index method can open at the same timebitmaptriples.indexmethod.disk.kway
- log of the number of way the system can merge indisk
index methodCLI changes
add
-options [opt]
,-config [file]
and-color
to hdtSearch.CORE changes
BitmapXBig
Implementation of Bitmap375Big and Bitmap64Big, working with both disk and memory bitmaps, use LargeArrays if required, removing the limit of 128B elements and allowing the Bitmap375 to have a disk co-index. The previous implementations are now deprecated and were replaced in the library with the new implementation.
LongLargeArray bug
A bug was found in LongLargeArray preventing a fast 0 set of the arrays, knowing how not updated this library is, a fix was added to IOUtil.
LongArrayDisk bug
Another bug to the LongArrayDisk was found, a fix is contained in this method.
KWay sort to create index
The legacy code was using ArrayLists to sort the seqZ/bitmapZ/seqY ids to create the object index, but it was taking a lot of memory, the optimized implementation was using bitmap375 to use the rank/select operations to reduce the memory usage, but it was still a lot of memory and the accesses were more random so a disk version can't work. So I used the kway merger of the disk generation method to sort like with the legacy version.
To store the chunks during the sort, I've applied a basic compression to store the ids with the delta instead of the plain values, it saved 120GB for a 17.5B seqZ, leading to max chunk of 80GB, but maybe it can be optimized to reduce the time/disk usage.
Tests
Obviously everything is tested with unit tests.
EDIT
After reindeing the wikidata HDT, I have these results, more accurate than the 5h described previously: