rdfhdt / hdt-java

HDT Java library and tools.
Other
94 stars 69 forks source link

HDTCatTree + HDTGenDisk #179

Closed ate47 closed 1 year ago

ate47 commented 1 year ago

I've merged the PR #172 and #162 because the 2 algorithms are better together.

This pull request create 2 new methods to generate HDT, catTree and disk (+some debug tools).

catTree create small HDTs using the generateHDT method and HDTCat them to reduce memory usage or being able to create HDTs without having the memory to store it.

disk will use merge sort to merge the sections and the triples. It is only available to create FourSectionDictionary based HDT and MultiSectionDictionary. It allows to create an HDT without having the memory to load it into memory.

API Changes

It add 10 new methods in HDTManager and 7 to implement


public static HDT generateHDT(InputStream fileStream, String baseURI, RDFNotation rdfNotation, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(String rdfFileName, String baseURI, RDFNotation rdfNotation, CompressionType compressionType, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(String rdfFileName, String baseURI, RDFNotation rdfNotation, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(String rdfFileName, String baseURI, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(InputStream fileStream, String baseURI, String filename, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(InputStream fileStream, String baseURI, RDFNotation rdfNotation, CompressionType compressionType, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(InputStream fileStream, String baseURI, RDFNotation rdfNotation, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(Iterator<TripleString> iterator, String baseURI, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT catTree(RDFFluxStop fluxStop, HDTSupplier supplier, String rdfFileName, String baseURI, RDFNotation rdfNotation, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT catTree(RDFFluxStop fluxStop, HDTSupplier supplier, InputStream rdfStream, String baseURI, RDFNotation rdfNotation, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT catTree(RDFFluxStop fluxStop, HDTSupplier supplier, Iterator<TripleString> iterator, String baseURI, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;

protected abstract HDT doHDTCatTree(RDFFluxStop fluxStop, HDTSupplier supplier, String filename, String baseURI, RDFNotation rdfNotation, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
protected abstract HDT doHDTCatTree(RDFFluxStop fluxStop, HDTSupplier supplier, InputStream stream, String baseURI, RDFNotation rdfNotation, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
protected abstract HDT doHDTCatTree(RDFFluxStop fluxStop, HDTSupplier supplier, Iterator<TripleString> iterator, String baseURI, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;

protected abstract HDT doGenerateHDT(InputStream fileStream, String baseURI, RDFNotation rdfNotation, CompressionType compressionType, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
protected abstract HDT doGenerateHDTDisk(String rdfFileName, String baseURI, RDFNotation rdfNotation, CompressionType compressionType, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
protected abstract HDT doGenerateHDTDisk(InputStream fileStream, String baseURI, RDFNotation rdfNotation, CompressionType compressionType, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
protected abstract HDT doGenerateHDTDisk(Iterator<TripleString> iterator, String baseURI, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;

It also 2 new classes to specify how to build the HDT with HDTSupplier and when to stop the RDF stream with RDFFluxStop.

Both HDTSupplier and RDFFluxStop have methods to quickly create instances.

static RDFFluxStop noLimit();
static RDFFluxStop countLimit(long maxTriple);
static RDFFluxStop sizeLimit(long maxSize);
static HDTSupplier memory();

It's also possible to use multiple limit with the

RDFFluxStop and(RDFFluxStop other);
RDFFluxStop or(RDFFluxStop other);

methods.

The loader type can be set to disk, cat or cat-disk to use the other methods with the base method.

org.rdfhdt.hdt.options.HDTOptionsKeys

To specify the specs of HDTOptions, we were asked to use plain string from the doc, instead, I've created the utility class to get the key names. I've added some key/values from the generateHDT method

UnicodeEscape fix

A small fix was made in this commit to fix the UnicodeEscape#escapeString(String, Appendable) method if the unicode delimiter isn't specify (no "" or <>)

org.rdfhdt.hdt.listener.MultiThreadListener

The current implementation of ProgressListener wasn't taking into account multiple threads computations. To fix this issue, I've added a new ProgressListener type, the multiple thread listener. Working like a progress listener, but with the origin thread.

An implementation was created in the HDT Java Command line Tools module.

Core changes

This pull request contains a fix for loaded/mapped hdt, the header wasn't containing the baseUri.

Generate Disk

This method is splitted into multiple phases, the parser is only using once the RDF file, so the implementation is be the same for File (String), InputStream or Iterator of TripleString.

Write triples/merge sections

For each triple, we will assign a new id to each node ((s, sid), (p, pid), (o, oid)), we attach to the component these ids at the same time it sort the components to 3 sections files with a merge sort. The ids are the number of the triple, so we don't need to store the triples.

At the end, we have 3 files of sorted compressed section file with an id attached to each strings (node, node_id) for subject, predicate, object

Create sections/id map files

With the raw triples, we create the 4 sections, removing the duplicates and get shared elements.

At the same time we fill 3 map files (SequenceLog64BigDisk) to be able to map the initial node id (sid, pid or oid) to the position in one of the 4 sections, we are using Sequence to reduce the disk usage of the maps.

We mark duplicate with the 1st bit and shared element with the 2nd bit, the other bits are the id in section for non duplicates, id of the original for duplicate.

So for example, if we have:

0b1100 -> Non shared element with index 3 (0b11) 0b1101 -> Shared element with index 3 (0b11) 0b1101 -> Duplicate element, the section index is in the map at the index 3 (0b11).

The dictionary is completed.

Map triples with section IDs/merge triples

During the first step, we have created nodes of the sections with incremental ids 1..numTriples, so we simply need to use the maps to map them using the maps created during the second step and sort them with merge sort.

Create triples

With the triple sorted, we can create the bitmap of the triples.

The triples are completed.

Create header

Simply create the header with the Dictionary/Triples parts, the original size isn't computed the same way as the generateHDT memory method, so the value can differ.

The Header and HDT is completed

Options

Findable with HDTOptionsKeys, the generate method can be config with multiple option keys

LOADER_DISK_COMPRESSION_MODE_KEY

Change thesort method, can be 2 values:

LOADER_DISK_COMPRESSION_WORKER_KEY

(For complete sort only)

The maximum count of workers to merge the files

LOADER_DISK_CHUNK_SIZE_KEY

The maximum size of a chunk to merge sort, by default it is 85% of 1 third of the allocated RAM.

LOADER_DISK_LOCATION_KEY

Set the working directory, by default it is set into a temporary folder, will be mkdirs before and delete after usage.

LOADER_DISK_FUTURE_HDT_LOCATION_KEY

Set the future HDT location, if this value is set, the method will generate the HDT file and map it, it reduces the RAM usage, by default the method will load into memory the HDT without creating a file.

Tests

To test this method, I'm generating 2 HDT with generateHDT and generateHDTDisk with map/load or partial/complete sort and check the equality of the 2 HDTs.

Some other tests are also present to test the writer/reader of in compression files and the mapping.

HDT Java Command line Tools Changes

Two new parameters were added to the rdf2hdt tool:

For the disk/tree generation, the new MultiThreadListener is used.

For HDTVerify, it works with MSC, it also verify for duplicated elements and print the current section. I've added the hdtVerify.bat file to use hdtVerify on Windows

D063520 commented 1 year ago

thank you very much for this contribution, it is great!