vliz-be-opsci / pysubyt

python module for Linked Data production (aka semantic uplifting) through Templating
MIT License
0 stars 1 forks source link

inject process metadata (provenance tracking) into produced turtle files #28

Open laurianvm opened 2 years ago

laurianvm commented 2 years ago

(draft) we should be able to track back to the origin of the record, track versions e.g. a data point that is altered after QC --> in order to do so a set of metadata triples should be produced (e.g. date, time, version of pysubyt, arguments, ...)

marc-portier commented 2 years ago

gave this some (a little) thought...

call optionally

It should be optional, and kept separate from the real data-flow, so a command line switch should be added to point to the provenance report to be generated.

-p path-to-prov-report.ttl

under control of template writer

It should only add provenance statements concerning selected nodes controlled by the template-designer. So the template-designer should have a mechanism to "add" certain selected URI to the provenance set. Maybe be wrapping that uri in a pass-through function like this in the template:

<{{provit(uritexpand("https://example.org/id/{#id}",_))}}> a ex:something.

calling towards a new function that follows this general structure:

def provit(uri):
    # actual code to register the uri, associated to the current runtime record-event-and-context
    return uri  # to achieve the pass-through effect

follow the template

It should eat our own dogfood , so the prov.ttl should be produced by some pysubyt template itself - we should have a built-in prov-template.ttl file inside the py lib package that actually holds the template producing the output based on an internal python-dict holding the assembled prov info during the run.

@laurianvm - if you agree with this approach, you might want to use this issue to draft / suggest the outlines of such python-dict and an appropriate template (and thus useful vocabs) :)

first ideas:

prov = {
  'about': { 'code': 'pysubyt@0.0.0', 'exects': '2021-11-23T21:15:52', ...} ,
  'context': {  ... stuff from the context , like flags ... } ,
  'inputs': { ... describing the files making up the sources of sets and _ ...} ,
  'events': [
    { 'source': ref to input-source, 
      'location': some ref to line and or item-number in the set, 
      'produced': [  ... list of  uri's that were registered through provit into this "event" ...] 
  ]
} 

direction of link

I don't like the idea that we would add this kind of provenance info as properties to nodes we add, i.e. let us not reuse those as ?subj nodes that get more structs added to their shapes.

Instead, I would prefer the prov-context to stand on its own feet, but rather link up to these registered nodes as ?obj members of an array listing all the outcomes of the described prov-action?

# rather not
:registeredNodeA :producedIn :someContext .
:registeredNodeB :producedIn :someContext .

# but rather
:someContext :producedItems [:registerednodeA, :registeredNodeB, ...] .

implementation thoughts