mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
435 stars 40 forks source link

Lineage / provenance representation #738

Open benjelloun opened 1 month ago

benjelloun commented 1 month ago

Define a mechanism to describe lineage of data / provenance information.

This mechanism should support multiple levels of granularity:

We would ideally reuse existing vocabularies such as PROV-O.

goeffthomas commented 3 weeks ago

We don't currently do it today, but this is related to a feature I'd love to see on Kaggle where datasets link to each other (or models) by showing which datasets are cleaned/etc. versions of others. Or which models were trained using which datasets. If we have a separate work stream to address this issue, please add me to any meetings/docs/etc.! I don't have the capacity to lead the effort right now, but very interested in participating.