Open donpellegrino opened 2 years ago
I think you can achieve that with the Model ModelFactory#createUnion(Model,Model) method, the datasets are usually for named graphs.
But if you are using that, the Union implementation is working with only 2 models and by using an HashSet to store seen triples. Chaining multiple Unions using 1 union/hdt might be memory consuming.
Edit: An internal method to HDT-CORE would be better (and harder) to implement if you can :)
It looks like the Apache Jena API DatasetFactory, Dataset.addNamedModel, and [Dataset.getUnionModel](https://jena.apache.org/documentation/javadoc/arq/org/apache/jena/query/Dataset.html#getUnionModel()) could be combined as another approach. @ate47 - Do you have any thoughts on what the consequences or efficiency of ModelFactory.createUnion would be versus Dataset.getUnionModel?
I can take a look at HDT-CORE as well. @ate47 - do you have a class or function point you could suggest for me to use as a starting point?
I'm not sure, but from my memories, you need to run store updates in the main dataset to merge the union model, so you need a Jena model because the HDT model can't handle updates and it will be long to load and to manage in memory, but I'm not a expert about this part, so you can try if you want.
To learn the internal usage of HDT, I would suggest to read this submission about it and then you can start by the Dictionaries, the default implementation (org.rdfhdt.hdt.dictionary.impl.FourSectionDictionary
) is the easiest to understand, then you can follow by the org.rdfhdt.hdt.compact
packages with the usage of the bitmaps in org.rdfhdt.hdt.triples.impl.BitmapTriples
class and reading in org.rdfhdt.hdt.hdt.impl.HDTImpl
how everything is linked together.
Querying HDT with SPARQL from the CLI only accepts a single HDT file at a time (https://github.com/rdfhdt/hdt-java/blob/master/hdt-jena/src/main/java/org/rdfhdt/hdtjena/cmd/HDTSparql.java). It would be a useful enhancement if multiple HDT files could be provided, and the query run over the aggregation.
One candidate implementation might use a Jena DatasetFactory for the aggregation, but I have not seen an example of how that might be used. If anyone can post an example of the correct use of Jena for this, then I should be able to implement the feature in HDTSparql.java.