We currently host data on huggingface where the data is organized into parquet files (so it can work well with the huggingface viewer). However, in building the image dataset for huggingface datasets, the image is encoded in a nested manner, i.e.,
While other attributes work very well with the framework by calling extract=mlc.Extract(column="Attrx"), the image cannot be properly processed simply with extract=mlc.Extract(column="image") because it returns a dict with key bytes and path, and the value of bytes is all we need. However, it seems that the current framework doesn't work well with further transforming dict by extracting the value of a key (I didn't find such functions). Is there any good workaround on this?
(practically I can just create another column and save a copy of image bytes that are compatible with croissant but that would simply double the file size which is not desired)
Hi,
We currently host data on huggingface where the data is organized into parquet files (so it can work well with the huggingface viewer). However, in building the image dataset for huggingface datasets, the image is encoded in a nested manner, i.e.,
While other attributes work very well with the framework by calling
extract=mlc.Extract(column="Attrx")
, the image cannot be properly processed simply withextract=mlc.Extract(column="image")
because it returns a dict with keybytes
andpath
, and the value ofbytes
is all we need. However, it seems that the current framework doesn't work well with further transforming dict by extracting the value of a key (I didn't find such functions). Is there any good workaround on this?(practically I can just create another column and save a copy of image bytes that are compatible with croissant but that would simply double the file size which is not desired)