mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
346 stars 38 forks source link

[NEURIPS] Hosted Editor doesn't allow nested Fields in RecordSets #675

Open francois-rd opened 1 month ago

francois-rd commented 1 month ago

Using the editor app hosted on HuggingFace (https://huggingface.co/spaces/MLCommons/croissant-editor), I'm trying to add a RecordSet to represent a nested JSON structure.

The format specification (https://docs.mlcommons.org/croissant/docs/croissant-spec.html#recordsets) seems to suggest that nested fields are possible, but the editor does not seem to support a nested data type (see image).

I was thinking about using a 'join' to another record set to build nested data, but my understanding is that 'join' is meant to cross-link files. My dataset contains standalone (not cross-linked) files each containing a series of nested JSON structures (one per instances), altogether in a JSON Lines format.

Screen Shot 2024-06-05 at 11 16 30
super-dainiu commented 1 month ago

I got the same issue.

benjelloun commented 1 month ago

Indeed the Croissant editor does not support nested fields yet.

You can export the json-ld for your dataset, and add them manually.

The mlcroissant python library can be used to validate your Croissant file.

francois-rd commented 1 month ago

Are there any plans to expand the capabilities of the editor? My dataset has a fairly complex structure and the prospect of having to manually create a file with several hundred lines of esoteric machine-friendly metadata is daunting to say the least...