[Neurips]: any support for large h5 files? tried various encodings but no luck.

Thanks for reaching out!

That is correct: at the moment, Croissant does not support h5 files: one cannot define filesets from h5 groups within h5 files or have fields which source is h5 datasets.

Support for h5 files could be added in a next version of the Croissant format spec.

In the meantime, we suggest to create a Croissant dataset that specifies the dataset level information and the resources, while omitting RecordSets that would need data coming from h5 files. I would use mime type application/x-hdf5 for the encodingFormat of your corresponding FileObject definitions. For example:

{
"name": "my-h5-dataset",
"license": "...",
"description": "...",
"distribution": [
    {
      "@type": "cr:FileObject",
      "@id": "data",
      "name": "data",
      "contentUrl": "data/my_h5_file.h5",
      "encodingFormat": "application/x-hdf5",
      "sha256": "..."
    },
...
]
}

This would allow tools that can work with only such metadata to already support your dataset (eg: index dataset, download raw data), while providing a signal for the Croissant contributors on the importance of supporting h5 format, as well as an example of dataset using h5 to test implementations when h5 support is added.

Please let us know if there are problems with defining such an incomplete croissant definition and we will look into this.

mlcommons / croissant

[Neurips]: any support for large h5 files? tried various encodings but no luck. #697