Image Encoding Issue when submitting a Parquet Dataset

PhilippeMoussalli commented 1 year ago

Describe the bug

Hello,

I'd like to report an issue related to pushing a dataset represented as a Parquet file to a dataset repository using Dask. Here are the details:

We attempted to load an example dataset in Parquet format from the Hugging Face (HF) filesystem using Dask with the following code snippet:

import dask.dataframe as dd
df = dd.read_parquet("hf://datasets/lambdalabs/pokemon-blip-captions",index=False)

In this dataset, the "image" column is represented as a dictionary/struct with the format:

df = df.compute()
df["image"].iloc[0].keys()
-> dict_keys(['bytes', 'path'])

I think this is the format encoded by the Image feature extractor from datasets to format suitable for Arrow.

The next step was to push the dataset to a repository that I created:

dd.to_parquet(dask_df, path = "hf://datasets/philippemo/dummy_dataset/data")

However, after pushing the dataset using Dask, the "image" column is now represented as the encoded dictionary (['bytes', 'path']), and the images are not properly visualized. You can find the dataset here: Link to the problematic dataset.

It's worth noting that both the original dataset and the one submitted with Dask have the same schema with minor alterations related to metadata:

Schema of original dummy example.

image: struct<bytes: binary, path: null>
  child 0, bytes: binary
  child 1, path: null
text: string

Schema of pushed dataset with dask

image: struct<bytes: binary, path: null>
  child 0, bytes: binary
  child 1, path: null
text: string

This issue seems to be related to an encoding type that occurs when pushing a model to the hub. Normally, models should be represented as an HF dataset before pushing, but we are working with an example where we need to push large datasets using Dask.

Could you please provide clarification on how to resolve this issue?

Thank you!

Reproduction

To get the schema I downloaded the parquet files and used pyarrow.parquet to read the schema

import pyarrow.parquet
pyarrow.parquet.read_schema(<path_to_parquet>, memory_map=True)

Logs

No response

System info

- huggingface_hub version: 0.14.1
- Platform: Linux-5.19.0-41-generic-x86_64-with-glibc2.35
- Python version: 3.10.6
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /home/philippe/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: philippemo
- Configured git credential helpers: cache
- FastAI: N/A
- Tensorflow: N/A
- Torch: N/A
- Jinja2: 3.1.2
- Graphviz: N/A
- Pydot: N/A
- Pillow: 9.4.0
- hf_transfer: N/A
- gradio: N/A
- ENDPOINT: https://huggingface.co
- HUGGINGFACE_HUB_CACHE: /home/philippe/.cache/huggingface/hub
- HUGGINGFACE_ASSETS_CACHE: /home/philippe/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/philippe/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False

Wauplin commented 1 year ago

Hi @PhilippeMoussalli thanks for opening a detailed issue. It seems the issue is more related to the datasets library so I'll ping @lhoestq @mariosasko on this one :)

(edit: also can one of you move the issue to the datasets repo? Thanks in advance 🙏)

lhoestq commented 1 year ago

Hi ! The Image() info is stored in the schema metadata. More precisely there should be a "huggingface" field in the schema metadata that contains the datasets feature type of each column.

To fix your issue, you can use the same schema as the original Parquet files to write the new ones. You can also get the schema with metadata from a Features object, e.g.

from datasets import Features, Image, Value

features = Features({"image": Image(), "text": Value("string")})
schema = features.arrow_schema
print(schema.metadata)
# {b'huggingface': b'{"info": {"features": {"image": {"_type": "Image"}, "text": {"dtype": "string", "_type": "Value"}}}}'}

lhoestq commented 1 year ago

It appears that the parquet files at hf://datasets/lambdalabs/pokemon-blip-captions don't have this metadata, and it is defined in the dataset_infos.json instead (legacy).

You can get the right schema with the HF metadata this way:

from datasets import load_dataset_builder

features = load_dataset_builder("lambdalabs/pokemon-blip-captions").info.features
schema = features.arrow_schema

lhoestq commented 1 year ago

Btw in the future we might add support for an dedicated Image extension type in Arrow so that you won't need to add the schema metadata anymore ;)

PhilippeMoussalli commented 1 year ago

Thanks @Wauplin @lhoestq for the quick reply :)!

I tried your approach by passing the huggingface schema to the dask writer

from datasets import Features, Image, Value
df = dd.read_parquet(f"hf://datasets/lambdalabs/pokemon-blip-captions",index=False)
features = Features({"image": Image(), "text": Value("string")})
schema = features.arrow_schema
dd.to_parquet(df, path = "hf://datasets/philippemo/dummy_dataset/data", schema=schema)

At first it didn't work as I was not able to visualize the images, so then I manually added the dataset_infos.json from the example dataset and it worked :)

However, It's not very ideal since there are some metadata in that file that need to be computed in order to load the data properly such as num_of_bytes and num_examples which might be unknown in my use case.

Screenshot from 2023-05-16 16-54-55

Do you have any pointers there? you mentioned that datasets_info.json will be deprecated/legacy. Could you point me to some example image datasets on the hub that are stored as parquet and don't have the datasets_info.json?

lhoestq commented 1 year ago

You don't need the dataset_infos.json file as long as you have the schema with HF metadata ;) I could also check that it works fine myself on the git revision without the dataset_infos.json file.

What made you think it didn't work ?

PhilippeMoussalli commented 1 year ago

You don't need the dataset_infos.json file as long as you have the schema with HF metadata ;) I could also check that it works fine myself on the git revision without the dataset_infos.json file.

What made you think it didn't work ?

Those are two identical dataset repos where both were pushed with dask with the specified schema you mentioned above. I then uploaded the dataset_infos.json manually taken from the original example dataset into one of them.

With schema: https://huggingface.co/datasets/philippemo/dummy_dataset_with_schema
Without schema: https://huggingface.co/datasets/philippemo/dummy_dataset_without_schema

You can see that in the examples without schema the images fail to render properly. When loaded with datasets they return an dict and not a Pillow Image

lhoestq commented 1 year ago

I see ! I think it's a bug on our side - it should work without the metadata - let me investigate

lhoestq commented 1 year ago

Alright, it's fixed: https://huggingface.co/datasets/philippemo/dummy_dataset_without_schema

It shows the image correctly now - even without the extra metadata :)

PhilippeMoussalli commented 1 year ago

Thanks @lhoestq! I tested pushing a dataset again without the metadata and it works perfectly! I appreciate the help

PhilippeMoussalli commented 1 year ago

Hi @lhoestq,

I'v tried pushing another dataset again and I think the issue reappeared again:

df = dd.read_parquet(f"hf://datasets/lambdalabs/pokemon-blip-captions")
features = datasets.Features({"image": datasets.Image(), "text": datasets.Value("string")})
schema = features.arrow_schema
dd.to_parquet(df, path = "hf://datasets/philippemo/dummy_dataset_without_schema_12_06/data", schema=schema)

Here is the dataset:
https://huggingface.co/datasets/philippemo/dummy_dataset_without_schema_12_06 The one that was working 2 weeks ago still seems to be intact though, it might be that It rendered properly when it was initially submitted and after this something was reverted from your side: https://huggingface.co/datasets/philippemo/dummy_dataset_without_schema

It's weird because nothing really changed from the implementation, might be another issue in the hub backend. Do you have any pointers on how to resolve this?

lhoestq commented 1 year ago

We're doing some changes in the way we're handling image parquet datasets right now. We'll include the fix from https://github.com/huggingface/datasets/pull/5921 in the new datasets-server version in the coming days

PhilippeMoussalli commented 1 year ago

alright thanks for the update :), would that be part of the new release of datasets or is it something separate? if so, where can I track it?

lhoestq commented 1 year ago

Once the new version of datasets is released (tomorrow probably) we'll open an issue on https://github.com/huggingface/datasets-server to update to this version :)

lhoestq commented 1 year ago

Alright we did the update :) This is fixed for good now

PhilippeMoussalli commented 1 year ago

Yes thanks 🎉🎉🎉

huggingface / datasets