trungdong / prov

A Python library for W3C Provenance Data Model (PROV)
http://prov.readthedocs.io/
MIT License
120 stars 44 forks source link

JSON deserialisation/serialisation: objects are duplicated #145

Closed Joe-Heffer-Shef closed 3 years ago

Joe-Heffer-Shef commented 3 years ago

There seems to be a different behaviour when repeatedly loading/saving PROV documents using RDF and JSON.

If I run this test script:

import pathlib

import prov.model

RDF_PATH = pathlib.Path('test.rdf')
JSON_PATH = pathlib.Path('test.json')

# Make sure files don't already exist
RDF_PATH.unlink(missing_ok=True)
JSON_PATH.unlink(missing_ok=True)

doc = prov.model.ProvDocument()
entity = doc.entity('prov:test')

doc.serialize(str(RDF_PATH), format='rdf')
doc.serialize('test.json', format='json', indent=2)

# Repeatedly modify documents
for _ in range(3):
    # RDF
    rdf_doc = prov.read(str(RDF_PATH), format='rdf')
    rdf_doc.entity('prov:test', {'prov:label': 'test 1'})
    rdf_doc.serialize(str(RDF_PATH), format='rdf')

    # JSON
    json_doc = prov.read(str(JSON_PATH), format='json')
    json_doc.entity('prov:test', {'prov:label': 'test 1'})
    json_doc.serialize(str(JSON_PATH), format='json', indent=2)

This is the RDF result:

@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

{
    prov:test a prov:Entity ;
        rdfs:label "test 1"^^xsd:string .
}

and this is the JSON result:

{
  "entity": {
    "prov:test": [
      {},
      {
        "prov:label": "test 1"
      },
      {
        "prov:label": "test 1"
      },
      {
        "prov:label": "test 1"
      }
    ]
  }
}

The RDF file seems to "update" as expected, while the JSON document has multiple redundant entries.

Joe-Heffer-Shef commented 3 years ago

This also seems to happen if I use the prov.model.ProvDocument.update method:

import pathlib

import prov.model

RDF_PATH = pathlib.Path('test.rdf')
JSON_PATH = pathlib.Path('test.json')

# Make sure files don't already exist
RDF_PATH.unlink(missing_ok=True)
JSON_PATH.unlink(missing_ok=True)

doc = prov.model.ProvDocument()
entity = doc.entity('prov:test')

doc.serialize(str(RDF_PATH), format='rdf')
doc.serialize('test.json', format='json', indent=2)

# Repeatedly modify documents
for _ in range(3):
    # RDF
    rdf_doc = prov.read(str(RDF_PATH), format='rdf')
    new_doc = prov.model.ProvDocument()
    new_doc.entity('prov:test', {'prov:label': 'test 1'})
    rdf_doc.update(new_doc)
    rdf_doc.serialize(str(RDF_PATH), format='rdf')

    # JSON
    json_doc = prov.read(str(JSON_PATH), format='json')
    new_doc = prov.model.ProvDocument()
    new_doc.entity('prov:test', {'prov:label': 'test 1'})
    json_doc.update(new_doc)
    json_doc.serialize(str(JSON_PATH), format='json', indent=2)
trungdong commented 3 years ago

The RDF serialisation automatically merges resources having the same identifiers.

When writing a PROV document, you may have cases where you assert, say, an activity started at time t0 and at a later point, when the activity completes, you then assert the same activity ended at time t1. That would create two activity records having the same identifier (as they refer to the same activity).

The PROV package has a simplistic function to unify records having the same identifiers if a programmer needs that. Otherwise, records are left untouched and are retained in the same order in which they are asserted.