mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
448 stars 40 forks source link

Data-level annotations #737

Open benjelloun opened 2 months ago

benjelloun commented 2 months ago

Add a mechanism to Croissant to define data-level annotations. Annotations are a general mechanism to attach additional information to other pieces of data. We plan to use annotations for a number of use cases, including:

benjelloun commented 1 month ago

Strawman proposal

Make annotation a first class property, so that we can clearly represent the fact that some contents of a RecordSet are annotations. You can think of an annotation as a special kind of field that annotates its container.

Here is an example of what a field-level annotation looks like:

{"@type": "cr:RecordSet", "@id": "images",
  "field": [
    { "@type": "cr:Field", "@id": "images/image", ... ,
      "annotation": {
        "@type": "cr:Field", "@id": "images/label", 
        "dataType": ["sc:Text", "cr:Label"]
      }
    }
  ]
}

In this example, the annotation "images/label" applies to the field "images/image".

Annotations can also appear at the level of a RecordSet. A RecordSet level annotation applies to the entire record. For example:

{
  "@type": "cr:RecordSet",
  "@id": "movies",
  "field": [
    { "@type": "cr:Field", "@id": "movies/movie_id", ...},
    { "@type": "cr:Field", "@id": "movies/title", ...},
    { "@type": "cr:Field", "@id": "movies/genre", ...}
  ],
  "annotation" : {
    "@type": "cr:Field", "@id": "movies/ratings", 
    subField: [
      { "@type": "cr:Field", "@id": "movies/ratings/user_id", ...}, 
      { "@type": "cr:Field", "@id": "movies/ratings/rating", ...}, 
    ]  
  }
}

In this example, ratings is a structured annotation that contains a user_id and a rating.

omshinde commented 3 weeks ago

Some examples of netcdf file for hierarchical data annotation -