Closed PhilippeMoussalli closed 1 year ago
Hi @PhilippeMoussalli thanks for opening a detailed issue. It seems the issue is more related to the datasets
library so I'll ping @lhoestq @mariosasko on this one :)
(edit: also can one of you move the issue to the datasets repo? Thanks in advance 🙏)
Hi ! The Image()
info is stored in the schema metadata. More precisely there should be a "huggingface" field in the schema metadata that contains the datasets
feature type of each column.
To fix your issue, you can use the same schema as the original Parquet files to write the new ones. You can also get the schema with metadata from a Features
object, e.g.
from datasets import Features, Image, Value
features = Features({"image": Image(), "text": Value("string")})
schema = features.arrow_schema
print(schema.metadata)
# {b'huggingface': b'{"info": {"features": {"image": {"_type": "Image"}, "text": {"dtype": "string", "_type": "Value"}}}}'}
It appears that the parquet files at hf://datasets/lambdalabs/pokemon-blip-captions
don't have this metadata, and it is defined in the dataset_infos.json instead (legacy).
You can get the right schema with the HF metadata this way:
from datasets import load_dataset_builder
features = load_dataset_builder("lambdalabs/pokemon-blip-captions").info.features
schema = features.arrow_schema
Btw in the future we might add support for an dedicated Image extension type in Arrow so that you won't need to add the schema metadata anymore ;)
Thanks @Wauplin @lhoestq for the quick reply :)!
I tried your approach by passing the huggingface schema to the dask writer
from datasets import Features, Image, Value
df = dd.read_parquet(f"hf://datasets/lambdalabs/pokemon-blip-captions",index=False)
features = Features({"image": Image(), "text": Value("string")})
schema = features.arrow_schema
dd.to_parquet(df, path = "hf://datasets/philippemo/dummy_dataset/data", schema=schema)
At first it didn't work as I was not able to visualize the images, so then I manually added the dataset_infos.json
from the example dataset and it worked :)
However, It's not very ideal since there are some metadata in that file that need to be computed in order to load the data properly such as num_of_bytes
and num_examples
which might be unknown in my use case.
Do you have any pointers there? you mentioned that datasets_info.json
will be deprecated/legacy. Could you point me to some example image datasets on the hub that are stored as parquet and don't have the datasets_info.json
?
You don't need the dataset_infos.json file as long as you have the schema with HF metadata ;) I could also check that it works fine myself on the git revision without the dataset_infos.json file.
What made you think it didn't work ?
You don't need the dataset_infos.json file as long as you have the schema with HF metadata ;) I could also check that it works fine myself on the git revision without the dataset_infos.json file.
What made you think it didn't work ?
Those are two identical dataset repos where both were pushed with dask with the specified schema you mentioned above. I then uploaded the dataset_infos.json
manually taken from the original example dataset into one of them.
You can see that in the examples without schema the images fail to render properly. When loaded with datasets
they return an dict and not a Pillow Image
I see ! I think it's a bug on our side - it should work without the metadata - let me investigate
Alright, it's fixed: https://huggingface.co/datasets/philippemo/dummy_dataset_without_schema
It shows the image correctly now - even without the extra metadata :)
Thanks @lhoestq! I tested pushing a dataset again without the metadata and it works perfectly! I appreciate the help
Hi @lhoestq,
I'v tried pushing another dataset again and I think the issue reappeared again:
df = dd.read_parquet(f"hf://datasets/lambdalabs/pokemon-blip-captions")
features = datasets.Features({"image": datasets.Image(), "text": datasets.Value("string")})
schema = features.arrow_schema
dd.to_parquet(df, path = "hf://datasets/philippemo/dummy_dataset_without_schema_12_06/data", schema=schema)
Here is the dataset:
https://huggingface.co/datasets/philippemo/dummy_dataset_without_schema_12_06
The one that was working 2 weeks ago still seems to be intact though, it might be that It rendered properly when it was initially submitted and after this something was reverted from your side:
https://huggingface.co/datasets/philippemo/dummy_dataset_without_schema
It's weird because nothing really changed from the implementation, might be another issue in the hub backend. Do you have any pointers on how to resolve this?
We're doing some changes in the way we're handling image parquet datasets right now. We'll include the fix from https://github.com/huggingface/datasets/pull/5921 in the new datasets-server version in the coming days
alright thanks for the update :), would that be part of the new release of datasets or is it something separate? if so, where can I track it?
Once the new version of datasets
is released (tomorrow probably) we'll open an issue on https://github.com/huggingface/datasets-server to update to this version :)
Alright we did the update :) This is fixed for good now
Yes thanks 🎉🎉🎉
Describe the bug
Hello,
I'd like to report an issue related to pushing a dataset represented as a Parquet file to a dataset repository using Dask. Here are the details:
We attempted to load an example dataset in Parquet format from the Hugging Face (HF) filesystem using Dask with the following code snippet:
In this dataset, the "image" column is represented as a dictionary/struct with the format:
I think this is the format encoded by the
Image
feature extractor from datasets to format suitable for Arrow.The next step was to push the dataset to a repository that I created:
However, after pushing the dataset using Dask, the "image" column is now represented as the encoded dictionary
(['bytes', 'path'])
, and the images are not properly visualized. You can find the dataset here: Link to the problematic dataset.It's worth noting that both the original dataset and the one submitted with Dask have the same schema with minor alterations related to metadata:
Schema of original dummy example.
Schema of pushed dataset with dask
This issue seems to be related to an encoding type that occurs when pushing a model to the hub. Normally, models should be represented as an HF dataset before pushing, but we are working with an example where we need to push large datasets using Dask.
Could you please provide clarification on how to resolve this issue?
Thank you!
Reproduction
To get the schema I downloaded the parquet files and used pyarrow.parquet to read the schema
Logs
No response
System info