Open cedricdcc opened 3 weeks ago
many interesting thoughts to discuss in here - but don't completely see how this works out
some upfront remarks
do we actually want a dict of list managing the entities, activities, agents? could we not go full board triples and just build an internal graph? fololwing that line of tinking the get_provenance()
should return an rdflib.Graph
or even be exposed as a @property prov_graph
?
the recurring Prov.method_name(self)
constructs look strange? what is the benefit over self.method_name()
?
self.generate_id()
does not (never will) use self - so why not make it @staticmethod
and ommit the self argument?
the instance bound decorator looks a bit fishy? is that common thing? some guides / best practices there?
also unclear how we are to manage those as member instance variables inside our classes, and not as globals to the source code file / module ?
furthering this topic I would like to attack advancing this with some top-down thought as well:
from output-side --> what provenance triples do we want from the various processes we have? (query, subyt, harvest, syncfs, ...) @laurianvm could you prepare some cases, examples for those? (@cedricdcc I guess this reflects your suggestion to 'discuss prov modal for py-sema ?)
from programmer pov --> how do we see this kind of common prov package actually make the work easier in the modules that need it? what would the effect be on query, subyt, syncfs, harvest, ...
unclear: what is the relation to the 'required task' on rdflib.Store()
object
do we actually want a dict of list managing the entities, activities, agents? could we not go full board triples and just build an internal graph? fololwing that line of tinking the get_provenance() should return an rdflib.Graph or even be exposed as a @property prov_graph ?
We can go full triples on first , my last remark on having the rdflib.Store() can then be dropped since this would be the internal graph then.
the recurring Prov.method_name(self) constructs look strange? what is the benefit over self.method_name() ?
Naming can be discussed , this was a rough first draw but the prov part can be dropped in a final implementation.
self.generate_id() does not (never will) use self - so why not make it @staticmethod and ommit the self argument?
Good remark , in the final version this can be the case.
the instance bound decorator looks a bit fishy? is that common thing? some guides / best practices there?
also unclear how we are to manage those as member instance variables inside our classes, and not as globals to the source code file / module ?
Managing provenance data as member variables inside classes, rather than as globals, is a better practice. It ensures encapsulation and avoids potential conflicts or unintended side effects. By using instance variables, we can also maintain cleaner and more modular code.
from output-side --> what provenance triples do we want from the various processes we have? (query, subyt, harvest, syncfs, ...) @laurianvm could you prepare some cases, examples for those? (@cedricdcc I guess this reflects your suggestion to 'discuss prov modal for py-sema ?)
@marc-portier yes I would like this to be a joint effort of the whole team to decide upon the prov model
from programmer pov --> how do we see this kind of common prov package actually make the work easier in the modules that need it? what would the effect be on query, subyt, syncfs, harvest, ...
Use the decorator in the main functions of all the top level fodlers like query, discovery , sema, bench, query to track functions that produce some resource or write some away like in commons.store.
All practical usecases need to be overviewed though I think this is a good starting point since the decorators are easely modified to our needs.
I've taken the liberty to update my first comment on the issue and modified the code according to some of your suggestions @marc-portier
inspiration from ROCrate community https://arxiv.org/pdf/2312.07852v2
services to consider + tracking:
We need to enhance our monorepo to include provenance tracking using a provonance ontology. This involves creating a Prov class that can track the provenance of function calls and class operations within our Python codebase. Additionally, we need a translation step to export the recorded provenance data to a TTL (Turtle) file format.
For the example below I've taken the liberty to use prov-o. NOTE that there can be mistakes in the terms used but its the technical impementation that counts.
Python script example braindump not tested
With the provenance data stored as an RDF graph, you can run SPARQL queries to analyze it:
Explanation code
1. Output Entity Recording:
2. Provenance Data Structure:
Example Usage:
This setup ensures that entities generated as outputs of functions are included in the provenance records, capturing the complete data lineage as per the PROV-O ontology.
Additional required tasks
References