Document count when indexing the same triples over and over again

hansidm commented 8 years ago

Document count is not consistent when indexing the same ontology over and over again. For instance, the oboe-core ontology - http://ecoinformatics.org/oboe/oboe.1.0/oboe-core.owl - loaded with command: curl -v http://localhost:7574/solr/store/update/bulk?commit=true -H "Content-Type: application/rdf+xml" --data-binary @oboe-core.owl

First load Num Docs goes to 576
Second load Num Docs goes to 838
Third load Num Docs goes to 1100

agazzarini commented 8 years ago

Hi @hansidm it seems the whole questions is about blank nodes. I don't have the necessary experience to assert that this is not a problem, so this is a premise to what I'm going to illustrate. Using a simple main with an in-memory Jena Model (i.e. no SolRDF at all), assuming you have a file /work/tmp/data.nt with the following content:

_:a_blank_node <:a_predicate> <:an_object> .

If you run this class

Model m1 = ModelFactory.createDefaultModel();

m1.read("file:///work/tmp/data.nt", "N-TRIPLES");
System.out.println(m1.size());

m1.read("file:///work/tmp/data.nt", "N-TRIPLES");
System.out.println(m1.size());

m1.read("file:///work/tmp/data.nt", "N-TRIPLES");
System.out.println(m1.size());

m1.read("file:///work/tmp/data.nt", "N-TRIPLES");
System.out.println(m1.size());

It will print

If, at the end, you print the Model using

m1.write(System.out, "N-TRIPLES");

the result will be something like this:

_:B1c659aefdfbc38db4b8a4b9b77b5428a <:a_predicate> <:an_object> .
_:B4f8bfeab0874829076de3e1a920ef1c0 <:a_predicate> <:an_object> .
_:Ba4e01cfd9afe0d188328cc7765642cd2 <:a_predicate> <:an_object> .
_:B6ad86e1f4f1a1d25e947df3eb4f33076 <:a_predicate> <:an_object> .

I said "something like" because you will see different identifiers for blank nodes. A blank node identifier is generated (different) on the fly each time it is inserted in a Model.

So, while I'm further investigating about this topic, at this moment, I believe this is not an issue: as the triple identity (s,p,o) is different each time it is inserted, it won't be replaced; and that's exactly the reason why the total size of your graph increases over and over again.

Maybe moving the question on the mailing list could be useful, so we could ask how the others are dealing with this scenario. What do you think?

agazzarini commented 8 years ago

After few days I don't have any further information about this topic, so at the moment I have to assume this is something related with the underlying RDF Framework (i.e. Apache Jena) or something that is "normal" within the RDF Data Model.

@hansidm as an alternative, you could load that ontology (or in general your dataset) in a specific graph; in this way, in case you need to reload the whole dataset, you could simply clean the graph and reload the dataset from scratch

Does it make sense? In the meantime, if you agree, I'm closing this issue

spaziocodice / SolRDF

Document count when indexing the same triples over and over again #103