mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
443 stars 40 forks source link

[NeurIPS] Compatibility with HuggingFace Image dataset #689

Closed zwcolin closed 4 months ago

zwcolin commented 5 months ago

Hi,

We currently host data on huggingface where the data is organized into parquet files (so it can work well with the huggingface viewer). However, in building the image dataset for huggingface datasets, the image is encoded in a nested manner, i.e.,

{
  Image: {'bytes': ..., 'path': ...}
  Attr2: ...
  Attr3: ...
  ...
}

While other attributes work very well with the framework by calling extract=mlc.Extract(column="Attrx"), the image cannot be properly processed simply with extract=mlc.Extract(column="image") because it returns a dict with key bytes and path, and the value of bytes is all we need. However, it seems that the current framework doesn't work well with further transforming dict by extracting the value of a key (I didn't find such functions). Is there any good workaround on this?

(practically I can just create another column and save a copy of image bytes that are compatible with croissant but that would simply double the file size which is not desired)

zwcolin commented 4 months ago

resolved by adding the following line transforms=[mlc.Transform(json_path="bytes")],