psychoinformatics-de / datalad-concepts

3 stars 2 forks source link

Extend ontology with components for datalad run records #3

Open mih opened 9 months ago

mih commented 9 months ago

This is information on a process responsible for a particular commit.

mih commented 8 months ago

With the provenance ontology component the groundwork is laid. The main connection to standard PROV will be the notion of a SoftwareAgent (the software that ran to produce a dataset revision). This agent description should somehow capture datalad-run as well as the "payload" software inside.

Candidate properties for linking the immediate software Agent with the payload Agent are:

mih commented 5 months ago

Additional, thoughts on how to map a run record onto PROV.

We need to distinguish the planning aspects from the actual activity. The cmd property of a run record can be considered the prov:Plan. This plan ideally makes parameters it may have explicit.

This plan is associated with an activity. A prov:SoftwareAgent (datalad-run), is also associated with that activity.

When the subject of the provenance is a dataset worktree, we can always consider the activity to be a prov:Derivation when there was a parent commit.

When the subject of provenance is a file, we can consider the activity a prov:Generation whenever there were no declared inputs, and prov:Derivation otherwise. or prov:Generation).

For one commit with a run record, we can create a prov report for the commit, and also for individual outputs. They would share the same prov:Plan. However, in practice it may be too much to distinguish individual activities for generating a full tree vs individual file. Likely we link to the same activity. Maybe that also implies that we should not distinguish between Generation and Derivation, but stick to Activity.