mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
346 stars 38 forks source link

[NeurIPS] Data type extension (Video and multichannel time series ) #690

Open pollytur opened 3 weeks ago

pollytur commented 3 weeks ago

It seems like videos are not supported in the Data types (here its only Image object, while in rdflib.namespace.SDO they have VideoObject separately ).

Also, I have a custom data type - technically, it is multidimensional time series (n channels $\times$ timepoints), so rdflib.namespace.SDO.ListItem would probably be the best for it but for now is it fine to use AudioObject for it?

Thanks a lot in advance!


Related to #371

pierrot0 commented 3 weeks ago

Thanks for reaching out!

Is the bug about the Croissant spec or about the mlcroissant python library?

One should be able to describe a dataset containing videos using Croissant, similarly as what is done in https://github.com/mlcommons/croissant/blob/main/datasets/1.0/audio_test/metadata.json (replacing sc:AudioObject by sc:VideoObject and audio/mpeg by video/mpeg for example).

It is however possible that libraries (including mlcroissant library) might not support videos atm.

Similarly as https://github.com/mlcommons/croissant/issues/696, when the data is stored in a file format which is not supported, we advise to create a Croissant dataset that specifies the dataset level information and the resources, while omitting RecordSets that contain data stored in files with an unsupported format.

This would unblock you, and it would allow tools that can work with only such metadata to already support your dataset (eg: index dataset, download raw data), while providing a signal for the Croissant contributors on which formats to support first, in the spec and/or various implementations.

Please let us know if there are problems with defining such an incomplete croissant definition and we will look into this.

pierrot0 commented 3 weeks ago

ok I see that checker raises an error in case of unknown mime type, we should extend that list and add a flag to allow for unknown mime types, we'll try to add that shortly.

pierrot0 commented 3 weeks ago

OK, so I did run validation (eg: mlcroissant validate --jsonld ../../datasets/1.0/titanic/metadata.json) on a croissant config containing an unknown encoding format, and it did not raise an error (https://github.com/mlcommons/croissant/blob/0f95e04763557929e4f4c6711c108c0d9cf7b818/python/mlcroissant/mlcroissant/_src/operation_graph/operations/read.py#L136-L139 was not raised, nor any other error).

And looking at the code, it seems to me like validation should work fine. Do you have a command line that would reproduce failure to validate a croissant file due to unknown encodingFormat?