mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
415 stars 39 forks source link

[NeurIPS] Support for (Sequential) Dataset Stored in TDMS Files #696

Open junzeliu opened 3 months ago

junzeliu commented 3 months ago

Hi,

I found it sequential/signal data that are stored in TDMS files are not supported in the Data types (code. The AudioObject seems to be designed for MP3 files. So, should I choose AudioObject?

Thank you!


Related to #690 and #371

pierrot0 commented 3 months ago

Thanks for reaching out!

That is correct: at the moment, the Croissant spec does not support TDMS files: one cannot refer to TDMS concepts (properties, channel groups and channels) to define Croissant FileSets or RecordSets.

Support for TDMS files could be added in a next version of the Croissant format spec.

In the meantime, we suggest to create a Croissant dataset that specifies the dataset level information and the resources, while omitting RecordSets that would need data coming from TDMS files. I would use mime type application/x-tdms for the encodingFormat of your corresponding FileObject definitions. For example:

{
"name": "my-tdms-dataset",
"license": "...",
"description": "...",
"distribution": [
    {
      "@type": "cr:FileObject",
      "@id": "data",
      "name": "data",
      "contentUrl": "data/my_tdms_file.tdms",
      "encodingFormat": "application/x-tdms",
      "sha256": "..."
    },
...
]
}

This would allow tools that can work with only such metadata to already support your dataset (eg: index dataset, download raw data), while providing a signal for the Croissant contributors on the importance of supporting TDMS format, as well as an example of dataset using TDMS to test implementations when TDMS support is added to the spec and various tools implementations.

Please let us know if there are problems with defining such an incomplete croissant definition and we will look into this.

junzeliu commented 3 months ago

Thank you very much @pierrot0 . If I'm understanding it correctly, I need to manually input these meta data in the JSON file, which is to be noted as the Croissant metadata file, right? So far, I found the easiest way to obtain a Croissant metadata file is to upoload my dataset onto Hugging Face's repository and use the Croissant Editor hosted by Hugging Face to produce the metadata. So, actually anyone can just manually input these information and save it as a JSON file, which will function as a Croissant metadata?

I know I might have asked too many questions. Many thanks in advacne :)

pierrot0 commented 3 months ago

The procedure you are describing might work indeed. You don't need to manually input the meta data in the JSON file though, you can, but don't have to. As you describe it yourself, you might be able to use the croissant editor hosted on HuggingFace to create/edit the Croissant config.