Closed hansidm closed 8 years ago
Hi @hansidm it seems the whole questions is about blank nodes. I don't have the necessary experience to assert that this is not a problem, so this is a premise to what I'm going to illustrate. Using a simple main with an in-memory Jena Model (i.e. no SolRDF at all), assuming you have a file /work/tmp/data.nt with the following content:
_:a_blank_node <:a_predicate> <:an_object> .
If you run this class
Model m1 = ModelFactory.createDefaultModel();
m1.read("file:///work/tmp/data.nt", "N-TRIPLES");
System.out.println(m1.size());
m1.read("file:///work/tmp/data.nt", "N-TRIPLES");
System.out.println(m1.size());
m1.read("file:///work/tmp/data.nt", "N-TRIPLES");
System.out.println(m1.size());
m1.read("file:///work/tmp/data.nt", "N-TRIPLES");
System.out.println(m1.size());
It will print
1
2
3
4
If, at the end, you print the Model using
m1.write(System.out, "N-TRIPLES");
the result will be something like this:
_:B1c659aefdfbc38db4b8a4b9b77b5428a <:a_predicate> <:an_object> .
_:B4f8bfeab0874829076de3e1a920ef1c0 <:a_predicate> <:an_object> .
_:Ba4e01cfd9afe0d188328cc7765642cd2 <:a_predicate> <:an_object> .
_:B6ad86e1f4f1a1d25e947df3eb4f33076 <:a_predicate> <:an_object> .
I said "something like" because you will see different identifiers for blank nodes. A blank node identifier is generated (different) on the fly each time it is inserted in a Model.
So, while I'm further investigating about this topic, at this moment, I believe this is not an issue: as the triple identity (s,p,o) is different each time it is inserted, it won't be replaced; and that's exactly the reason why the total size of your graph increases over and over again.
Maybe moving the question on the mailing list could be useful, so we could ask how the others are dealing with this scenario. What do you think?
After few days I don't have any further information about this topic, so at the moment I have to assume this is something related with the underlying RDF Framework (i.e. Apache Jena) or something that is "normal" within the RDF Data Model.
@hansidm as an alternative, you could load that ontology (or in general your dataset) in a specific graph; in this way, in case you need to reload the whole dataset, you could simply clean the graph and reload the dataset from scratch
Does it make sense? In the meantime, if you agree, I'm closing this issue
Document count is not consistent when indexing the same ontology over and over again. For instance, the oboe-core ontology - http://ecoinformatics.org/oboe/oboe.1.0/oboe-core.owl - loaded with command: curl -v http://localhost:7574/solr/store/update/bulk?commit=true -H "Content-Type: application/rdf+xml" --data-binary @oboe-core.owl