mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
452 stars 41 forks source link

Croissant vocabulary for crawled datasets #762

Open wumpus opened 3 weeks ago

wumpus commented 3 weeks ago

Related to #738 I would like to create any necessary new controlled language necessary to describe a crawled dataset.

I propose:

I have other interested users -- the ARDC (Alliance for Responsible Data Collection) would like to mandate a machine-readable metadata format for its users. This will serve a role similar to Croissant-RAI.

benjelloun commented 2 weeks ago

Can some or all of these crawls be thought of as different versions of the same dataset? If so, Croissant has support for representing versions, so you could model them that way. However, there is no mechanism currently available to enumerate all existing versions of a dataset.

wumpus commented 2 weeks ago

I'm not sure how helpful that is for this task, it's more something you might do after I can write down the croissant for 1 dataset.