Provenance annotation - Githubissues

mih commented 9 months ago

Provenance records are a key feature of DataLad, hence must be reflected in the data model too. Two different approaches exists, and possibly have to be supported in parallel.

DataLad's native provenance recording is centered on annotating a particular commit. This means that the provenance record is associated with the complete state (or version) of a dataset. This is true even if a particular commit represents only a partial change of a dataset. Such difference (i.e. from a commit to another commit) attributions are a result of a post-recording analysis, not an actual part of a record.
It can be useful to have a "flattened", single-version metadata record of a dataset. In such a record, the provenance of individual files (rather than, or in addition to the full dataset version) would be annotated. This could be realized by attaching the provenance record to a representation of one or more files. The effective record would be the one associated with the last recorded version of a file.

The main difference between (1) and (2) is the (hierarchical) representation of information.

(1) would follow the Git data model. Each version of a dataset is represented by a "commit", each commit refers to a "tree" that uniquely described the complete dataset (linking versioned records of subtrees and files. A single provenance record is associated with a single commit. The full version history is discoverable by traversing the commit history.

(2) rather then centering of data versions, this representation is file tree focused. There would be a single "dataset" record, with a version attribute rather than a list of dataset version records. a file tree is a primary property of this dataset record. Each file in this tree is linked to a provenance record -- that refers to its last modification. Each of these provenance records could be linked to additional records of prior modifications.

Each representation would have different primary use cases

(1) Ideally suited for serializing datalad dataset, due to the conceptual and practical alignment with Git's data model. The primary challenge for such a serialization would be dealing with a "linearization" of history (pick the "main" ancestor in a multi-way merge vs all of them, etc). It would be relatively straightforward to deserialize such a record to a multi-version Git repo again.

(2) This representation of "latest" of a dataset is best aligned with metadata standards such as RO-crate (https://www.researchobject.org/ro-crate/1.1/provenance.html; and many others that focus on describing a dataset that is placed "next to" the actual data (archive)). Provenance is described in a more "anecdotal" fashion, useful for human consumption and documentation. It would be harder to create an environment to perform an actual re-execution of a provenance record.

Generation (2) from (1) it somewhat simpler than generating (1) from (2).

mih commented 9 months ago

https://ceur-ws.org/Vol-1035/iswc2013_demo_32.pdf has a suitable mapping of the prov attributes encoded by Git to the PROV-DM. This allows is to use this standard model for (1) verbatim, and for (2) in an adjusted setup.

mih commented 9 months ago

This is largely done in main now.

psychoinformatics-de / datalad-concepts

Provenance annotation #13