[NeurIPS] Compatibility with HuggingFace Image dataset

Hi,

We currently host data on huggingface where the data is organized into parquet files (so it can work well with the huggingface viewer). However, in building the image dataset for huggingface datasets, the image is encoded in a nested manner, i.e.,

{
  Image: {'bytes': ..., 'path': ...}
  Attr2: ...
  Attr3: ...
  ...
}

While other attributes work very well with the framework by calling extract=mlc.Extract(column="Attrx"), the image cannot be properly processed simply with extract=mlc.Extract(column="image") because it returns a dict with key bytes and path, and the value of bytes is all we need. However, it seems that the current framework doesn't work well with further transforming dict by extracting the value of a key (I didn't find such functions). Is there any good workaround on this?

(practically I can just create another column and save a copy of image bytes that are compatible with croissant but that would simply double the file size which is not desired)

mlcommons / croissant

[NeurIPS] Compatibility with HuggingFace Image dataset #689