rdfhdt / hdt-java

HDT Java library and tools.
Other
94 stars 69 forks source link

generateHDTDisk : a merge sort on disk to create HDTs #162

Closed ate47 closed 1 year ago

ate47 commented 2 years ago

This pull request ask to add a disk-based HDT generation method: generateHDT.

This method will use merge sort to merge the sections and the triples. It is only available to create FourSectionDictionary based HDT. It allow to create HDT without having the memory to load it into memory.

API Changes

HDTManager

In the HDTManager abstract class, we now have 4 new methods to implement

protected abstract HDT doGenerateHDT(InputStream fileStream, String baseURI, RDFNotation rdfNotation, CompressionType compressionType, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
protected abstract HDT doGenerateHDTDisk(String rdfFileName, String baseURI, RDFNotation rdfNotation, CompressionType compressionType, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
protected abstract HDT doGenerateHDTDisk(InputStream fileStream, String baseURI, RDFNotation rdfNotation, CompressionType compressionType, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
protected abstract HDT doGenerateHDTDisk(Iterator<TripleString> iterator, String baseURI, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;

A method consuming an InputStream was asked in #86, so I've added it for both, Disk and Memory generate.

These methods also allow the usage of compressed file with the add of org.rdfhdt.hdt.enums.CompressionType to uncompress without the name of the file.

added to that multiple versions to use these implementations with HDTManager.generateHDTDisk(...).

public static HDT generateHDT(InputStream fileStream, String baseURI, RDFNotation rdfNotation, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(String rdfFileName, String baseURI, RDFNotation rdfNotation, CompressionType compressionType, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(String rdfFileName, String baseURI, RDFNotation rdfNotation, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(String rdfFileName, String baseURI, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(InputStream fileStream, String baseURI, String filename, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(InputStream fileStream, String baseURI, RDFNotation rdfNotation, CompressionType compressionType, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(InputStream fileStream, String baseURI, RDFNotation rdfNotation, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(Iterator<TripleString> iterator, String baseURI, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;

org.rdfhdt.hdt.options.HDTOptionsKeys

To specify the specs of HDTOptions, we were asked to use plain string from the doc, instead, I've created the utility class to get the key names. I've added some key/values from the generateHDT method

UnicodeEscape fix

A small fix was made in this commit to fix the UnicodeEscape#escapeString(String, Appendable) method if the unicode delimiter isn't specify (no "" or <>)

org.rdfhdt.hdt.listener.MultiThreadListener

The current implementation of ProgressListener wasn't taking into account multiple threads computations. To fix this issue, I've added a new ProgressListener type, the multiple thread listener. Working like a progress listener, but with the origin thread.

An implementation was created in the HDT Java Command line Tools module.

Core Changes

PlainHeader

This pull request contains a fix for loaded/mapped hdt, the header wasn't containing the baseUri.

Generate Disk

This method is splitted into multiple phases, the parser is only using once the RDF file, so the implementation is be the same for File (String), InputStream or Iterator of TripleString.

Write triples/merge sections

For each triple, we will assign a new id to each node ((s, sid), (p, pid), (o, oid)), we attach to the component these ids at the same time it sort the components to 3 sections files with a merge sort. The ids are the number of the triple, so we don't need to store the triples.

At the end, we have 3 files of sorted compressed section file with an id attached to each strings (node, node_id) for subject, predicate, object

Create sections/id map files

With the raw triples, we create the 4 sections, removing the duplicates and get shared elements.

At the same time we fill 3 map files (SequenceLog64BigDisk) to be able to map the initial node id (sid, pid or oid) to the position in one of the 4 sections, we are using Sequence to reduce the disk usage of the maps.

We mark duplicate with the 1st bit and shared element with the 2nd bit, the other bits are the id in section for non duplicates, id of the original for duplicate.

So for example, if we have:

0b1100 -> Non shared element with index 3 (0b11) 0b1101 -> Shared element with index 3 (0b11) 0b1101 -> Duplicate element, the section index is in the map at the index 3 (0b11).

The dictionary is completed.

Map triples with section IDs/merge triples

During the first step, we have created nodes of the sections with incremental ids 1..numTriples, so we simply need to use the maps to map them using the maps created during the second step and sort them with merge sort.

Create triples

With the triple sorted, we can create the bitmap of the triples.

The triples are completed.

Create header

Simply create the header with the Dictionary/Triples parts, the original size isn't computed the same way as the generateHDT memory method, so the value can differ.

The Header and HDT is completed

Options

Findable with HDTOptionsKeys, the generate method can be config with multiple option keys

LOADER_DISK_COMPRESSION_MODE_KEY

Change thesort method, can be 2 values:

LOADER_DISK_COMPRESSION_WORKER_KEY

(For complete sort only)

The maximum count of workers to merge the files

LOADER_DISK_CHUNK_SIZE_KEY

The maximum size of a chunk to merge sort, by default it is 85% of 1 third of the allocated RAM.

LOADER_DISK_LOCATION_KEY

Set the working directory, by default it is set into a temporary folder, will be mkdirs before and delete after usage.

LOADER_DISK_FUTURE_HDT_LOCATION_KEY

Set the future HDT location, if this value is set, the method will generate the HDT file and map it, it reduces the RAM usage, by default the method will load into memory the HDT without creating a file.

Tests

To test this method, I'm generating 2 HDT with generateHDT and generateHDTDisk with map/load or partial/complete sort and check the equality of the 2 HDTs.

Some other tests are also present to test the writer/reader of in compression files and the mapping.

HDT Java Command line Tools Changes

Two new parameters were added to the rdf2hdt tool:

For the disk generation, the new MultiThreadListener is used.

For HDTVerify, it also verify for duplicated elements and print the current section.

mielvds commented 2 years ago

Hi @ate47 , looks neat! Question though: how does this compare to simply using a on-disk key value store like RocksDB as a drop in replacement for the in-memory data structures?

ate47 commented 2 years ago

Hi @ate47 , looks neat! Question though: how does this compare to simply using a on-disk key value store like RocksDB as a drop in replacement for the in-memory data structures?

I think some tests are mandatory to answer to this question, but my guess is that it would use more space on disk, because I am compressing the dictionaries/triples while parsing. I've never tried this technology, but the sections creation would probably be faster from what I have read, for the triples compression, I don't know if I can use this to sort the triples without using too much memory (disk/ram), so I would be using the same algorithm as the one in the pull request, so it would be the same.