mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
415 stars 39 forks source link

[Neurips]: any support for large h5 files? tried various encodings but no luck. #697

Open jhirschm opened 3 months ago

jhirschm commented 3 months ago

Our data is formatted with groups within h5 file. Seems like the few encoding options available do not support h5?

pierrot0 commented 3 months ago

Thanks for reaching out!

That is correct: at the moment, Croissant does not support h5 files: one cannot define filesets from h5 groups within h5 files or have fields which source is h5 datasets.

Support for h5 files could be added in a next version of the Croissant format spec.

In the meantime, we suggest to create a Croissant dataset that specifies the dataset level information and the resources, while omitting RecordSets that would need data coming from h5 files. I would use mime type application/x-hdf5 for the encodingFormat of your corresponding FileObject definitions. For example:

{
"name": "my-h5-dataset",
"license": "...",
"description": "...",
"distribution": [
    {
      "@type": "cr:FileObject",
      "@id": "data",
      "name": "data",
      "contentUrl": "data/my_h5_file.h5",
      "encodingFormat": "application/x-hdf5",
      "sha256": "..."
    },
...
]
}

This would allow tools that can work with only such metadata to already support your dataset (eg: index dataset, download raw data), while providing a signal for the Croissant contributors on the importance of supporting h5 format, as well as an example of dataset using h5 to test implementations when h5 support is added.

Please let us know if there are problems with defining such an incomplete croissant definition and we will look into this.