Closed mih closed 9 months ago
https://ceur-ws.org/Vol-1035/iswc2013_demo_32.pdf has a suitable mapping of the prov attributes encoded by Git to the PROV-DM. This allows is to use this standard model for (1) verbatim, and for (2) in an adjusted setup.
This is largely done in main
now.
Provenance records are a key feature of DataLad, hence must be reflected in the data model too. Two different approaches exists, and possibly have to be supported in parallel.
The main difference between (1) and (2) is the (hierarchical) representation of information.
(1) would follow the Git data model. Each version of a dataset is represented by a "commit", each commit refers to a "tree" that uniquely described the complete dataset (linking versioned records of subtrees and files. A single provenance record is associated with a single commit. The full version history is discoverable by traversing the commit history.
(2) rather then centering of data versions, this representation is file tree focused. There would be a single "dataset" record, with a version attribute rather than a list of dataset version records. a file tree is a primary property of this dataset record. Each file in this tree is linked to a provenance record -- that refers to its last modification. Each of these provenance records could be linked to additional records of prior modifications.
Each representation would have different primary use cases
(1) Ideally suited for serializing datalad dataset, due to the conceptual and practical alignment with Git's data model. The primary challenge for such a serialization would be dealing with a "linearization" of history (pick the "main" ancestor in a multi-way merge vs all of them, etc). It would be relatively straightforward to deserialize such a record to a multi-version Git repo again.
(2) This representation of "latest" of a dataset is best aligned with metadata standards such as RO-crate (https://www.researchobject.org/ro-crate/1.1/provenance.html; and many others that focus on describing a dataset that is placed "next to" the actual data (archive)). Provenance is described in a more "anecdotal" fashion, useful for human consumption and documentation. It would be harder to create an environment to perform an actual re-execution of a provenance record.
Generation (2) from (1) it somewhat simpler than generating (1) from (2).