spacepy / dbprocessing

Automated processing controller for heliophysics data
5 stars 4 forks source link

Provide provenance information in formal provenance model #65

Open jtniehof opened 3 years ago

jtniehof commented 3 years ago

dbprocessing stores "provenance" in the form of the command line used to generate a file and the file IDs of both its input files and the code that was run on those files. This is somewhat different from the formal definition and implementation of "provenance." @BaptisteCecconi has suggested presenting this in a way that is compatible with the IVOA provenance model (this presentation might also help).

Proposed enhancement

Anything that helps connect dbprocessing to the broader formal provenance community. This may involve providing output that's directly useful in a formal provenance model, either in the description of products/processes (dbp definition) or in the specific file creation. Support for using dbprocessing information to ease our user's integration with provenance would also be helpful (e.g. helper tools that make something not directly usable as provenance, but that does some of the "grunt work.")

It will be worth looking into where IVOA connects with SPASE, as the formal product definition in Heliophysics is usually defined in SPASE.

Alternatives

This enhancement is pretty wide open, so alternatives are yet to be established and evaluated. The do-nothing option is always an alternative. An intermediate approach would be to provide tools that help expose the provenance information within the database, for example, improving the traceback scripts, or provide a dependency graph based on the database configuration. These may also be reasonable intermediate steps in establishing full formal provenance support.

OS, Python version, and dependency version information:

Linux-4.4.0-98-generic-x86_64-with-Ubuntu-16.04-xenial
sys.version_info(major=2, minor=7, micro=12, releaselevel='final', serial=0)
sqlalchemy=1.0.11

Version of dbprocessing

Current master (11a417e89ac8dd2168982e12008504c66c13f8c0)

Closure condition

This enhancement might result in several other enhancement issues, pull requests, and perhaps a project to organize it. Thus this issue should be closed when the approach is fully scoped-out and other enhancements, projects, etc. opened that will collectively capture the full design, implementation, and documentation of this provenance.

jtniehof commented 3 years ago

Another related thing that might be interesting would be the ability to export a short script that will create a file from the earliest input, e.g. assuming that the codes are in place on the system, run L0 to L1, L1 to L2, L2 to L3 to make the L3.