huggingface / datasets

πŸ€— The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.26k stars 2.7k forks source link

Dataset Viewer issue for darragh/demo_data_raw3 #4840

Open severo opened 2 years ago

severo commented 2 years ago

Link

https://huggingface.co/datasets/darragh/demo_data_raw3

Description

Exception:     ValueError
Message:       Arrow type extension<arrow.py_extension_type<pyarrow.lib.UnknownExtensionType>> does not have a datasets dtype equivalent.

reported by @NielsRogge

Owner

No

severo commented 2 years ago

do you have an idea of why it can occur @huggingface/datasets? The dataset consists of a single parquet file.

albertvillanova commented 2 years ago

Thanks for reporting @severo.

I'm not able to reproduce that error. I get instead:

FileNotFoundError: [Errno 2] No such file or directory: 'orix/data/ChiSig/ε”εˆδΉ-9-3.jpg'

Which pyarrow version are you using? Mine is 6.0.1.

albertvillanova commented 2 years ago

OK, I get now your error when not streaming.

severo commented 2 years ago

OK!

If it's useful, the pyarrow version is 7.0.0:

https://github.com/huggingface/datasets-server/blob/487c39d87998f8d5a35972f1027d6c8e588e622d/services/worker/poetry.lock#L1537-L1543

albertvillanova commented 2 years ago

Apparently, there is something weird with that Parquet file: its schema is:

images: extension<arrow.py_extension_type<pyarrow.lib.UnknownExtensionType>>

I have forced a right schema:

from datasets import Features, Image, load_dataset

features = Features({"images": Image()})
ds = datasets.load_dataset("parquet", split="train", data_files="train-00000-of-00001.parquet", features=features)

and then recreated a new Parquet file:

ds.to_parquet("train.parquet")

Now this Parquet file has the right schema:

images: struct<bytes: binary, path: string>
  child 0, bytes: binary
  child 1, path: string

and can be loaded normally:

In [26]: ds = load_dataset("parquet", split="train", data_files="dataset.parquet")
n [27]: ds
Out[27]: 
Dataset({
    features: ['images'],
    num_rows: 20
})