Closed Joe-Heffer-Shef closed 3 years ago
This also seems to happen if I use the prov.model.ProvDocument.update
method:
import pathlib
import prov.model
RDF_PATH = pathlib.Path('test.rdf')
JSON_PATH = pathlib.Path('test.json')
# Make sure files don't already exist
RDF_PATH.unlink(missing_ok=True)
JSON_PATH.unlink(missing_ok=True)
doc = prov.model.ProvDocument()
entity = doc.entity('prov:test')
doc.serialize(str(RDF_PATH), format='rdf')
doc.serialize('test.json', format='json', indent=2)
# Repeatedly modify documents
for _ in range(3):
# RDF
rdf_doc = prov.read(str(RDF_PATH), format='rdf')
new_doc = prov.model.ProvDocument()
new_doc.entity('prov:test', {'prov:label': 'test 1'})
rdf_doc.update(new_doc)
rdf_doc.serialize(str(RDF_PATH), format='rdf')
# JSON
json_doc = prov.read(str(JSON_PATH), format='json')
new_doc = prov.model.ProvDocument()
new_doc.entity('prov:test', {'prov:label': 'test 1'})
json_doc.update(new_doc)
json_doc.serialize(str(JSON_PATH), format='json', indent=2)
The RDF serialisation automatically merges resources having the same identifiers.
When writing a PROV document, you may have cases where you assert, say, an activity started at time t0 and at a later point, when the activity completes, you then assert the same activity ended at time t1. That would create two activity records having the same identifier (as they refer to the same activity).
The PROV package has a simplistic function to unify records having the same identifiers if a programmer needs that. Otherwise, records are left untouched and are retained in the same order in which they are asserted.
There seems to be a different behaviour when repeatedly loading/saving PROV documents using RDF and JSON.
If I run this test script:
This is the RDF result:
and this is the JSON result:
The RDF file seems to "update" as expected, while the JSON document has multiple redundant entries.