Open severo opened 2 years ago
do you have an idea of why it can occur @huggingface/datasets? The dataset consists of a single parquet file.
Thanks for reporting @severo.
I'm not able to reproduce that error. I get instead:
FileNotFoundError: [Errno 2] No such file or directory: 'orix/data/ChiSig/εεδΉ-9-3.jpg'
Which pyarrow version are you using? Mine is 6.0.1.
OK, I get now your error when not streaming.
OK!
If it's useful, the pyarrow version is 7.0.0:
Apparently, there is something weird with that Parquet file: its schema is:
images: extension<arrow.py_extension_type<pyarrow.lib.UnknownExtensionType>>
I have forced a right schema:
from datasets import Features, Image, load_dataset
features = Features({"images": Image()})
ds = datasets.load_dataset("parquet", split="train", data_files="train-00000-of-00001.parquet", features=features)
and then recreated a new Parquet file:
ds.to_parquet("train.parquet")
Now this Parquet file has the right schema:
images: struct<bytes: binary, path: string>
child 0, bytes: binary
child 1, path: string
and can be loaded normally:
In [26]: ds = load_dataset("parquet", split="train", data_files="dataset.parquet")
n [27]: ds
Out[27]:
Dataset({
features: ['images'],
num_rows: 20
})
Link
https://huggingface.co/datasets/darragh/demo_data_raw3
Description
reported by @NielsRogge
Owner
No