Support query of multiple HDT files from CLI

rdfhdt / hdt-java

HDT Java library and tools.

Other

94 stars 68 forks source link

Support query of multiple HDT files from CLI #166

Open donpellegrino opened 2 years ago

donpellegrino commented 2 years ago

Querying HDT with SPARQL from the CLI only accepts a single HDT file at a time (https://github.com/rdfhdt/hdt-java/blob/master/hdt-jena/src/main/java/org/rdfhdt/hdtjena/cmd/HDTSparql.java). It would be a useful enhancement if multiple HDT files could be provided, and the query run over the aggregation.

One candidate implementation might use a Jena DatasetFactory for the aggregation, but I have not seen an example of how that might be used. If anyone can post an example of the correct use of Jena for this, then I should be able to implement the feature in HDTSparql.java.

ate47 commented 2 years ago

I think you can achieve that with the Model ModelFactory#createUnion(Model,Model) method, the datasets are usually for named graphs.

But if you are using that, the Union implementation is working with only 2 models and by using an HashSet to store seen triples. Chaining multiple Unions using 1 union/hdt might be memory consuming.

Edit: An internal method to HDT-CORE would be better (and harder) to implement if you can :)

donpellegrino commented 2 years ago

It looks like the Apache Jena API DatasetFactory, Dataset.addNamedModel, and [Dataset.getUnionModel](https://jena.apache.org/documentation/javadoc/arq/org/apache/jena/query/Dataset.html#getUnionModel()) could be combined as another approach. @ate47 - Do you have any thoughts on what the consequences or efficiency of ModelFactory.createUnion would be versus Dataset.getUnionModel?

I can take a look at HDT-CORE as well. @ate47 - do you have a class or function point you could suggest for me to use as a starting point?

ate47 commented 2 years ago

I'm not sure, but from my memories, you need to run store updates in the main dataset to merge the union model, so you need a Jena model because the HDT model can't handle updates and it will be long to load and to manage in memory, but I'm not a expert about this part, so you can try if you want.

To learn the internal usage of HDT, I would suggest to read this submission about it and then you can start by the Dictionaries, the default implementation (org.rdfhdt.hdt.dictionary.impl.FourSectionDictionary) is the easiest to understand, then you can follow by the org.rdfhdt.hdt.compact packages with the usage of the bitmaps in org.rdfhdt.hdt.triples.impl.BitmapTriples class and reading in org.rdfhdt.hdt.hdt.impl.HDTImpl how everything is linked together.