huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.29k stars 2.7k forks source link

JSON lines with missing struct fields raise TypeError: Couldn't cast array #7159

Closed albertvillanova closed 2 months ago

albertvillanova commented 2 months ago

JSON lines with missing struct fields raise TypeError: Couldn't cast array of type.

See example: https://huggingface.co/datasets/wikimedia/structured-wikipedia/discussions/5

One would expect that the struct missing fields are added with null values.

Aremaki commented 1 month ago

Hello,

I have still the same issue when loading the dataset with the new version: https://huggingface.co/datasets/wikimedia/structured-wikipedia/discussions/5

I have downloaded and unzipped the wikimedia/structured-wikipedia dataset locally but when loading I have the same issue.

import datasets

dataset = datasets.load_dataset("/gpfsdsdir/dataset/HuggingFace/wikimedia/structured-wikipedia/20240916.fr")
TypeError: Couldn't cast array of type
struct<content_url: string, width: int64, height: int64, alternative_text: string>
to
{'content_url': Value(dtype='string', id=None), 'width': Value(dtype='int64', id=None), 'height': Value(dtype='int64', id=None)}

The above exception was the direct cause of the following exception:

My version of datasets is 3.0.1