Open benjelloun opened 2 months ago
We don't currently do it today, but this is related to a feature I'd love to see on Kaggle where datasets link to each other (or models) by showing which datasets are cleaned/etc. versions of others. Or which models were trained using which datasets. If we have a separate work stream to address this issue, please add me to any meetings/docs/etc.! I don't have the capacity to lead the effort right now, but very interested in participating.
Last Wednesday in the Croissant WG meeting I pitched exactly this idea -- I want to start with a croissant being able to refer to other croissants. For example, I would make croissants for all 105 Common Crawl crawls (see #762), and people publishing cleaned ML training sets would write croissants that pointed at CC crawl croissants.
Apparently DDI-CDI has this kind of thing built into it. That was the presentation last week at the Croissant meeting, talk was by Arofan Gregory and slides are https://docs.google.com/presentation/d/1-9sg1X8siHZCa4Zh_7vIOzreEz0-3d-d/edit?usp=sharing&ouid=104300626440986451054&rtpof=true&sd=true
Define a mechanism to describe lineage of data / provenance information.
This mechanism should support multiple levels of granularity:
We would ideally reuse existing vocabularies such as PROV-O.