Closed hlky closed 3 months ago
Since datasets
uses is built on Arrow to store the data, it requires each sample to have the same columns.
This can be fixed by specifyign in advance the name of all the possible columns in the dataset_info
in YAML, and missing values will be None
Thanks. This currently doesn't work for WebDataset because there's no BuilderConfig
with features
and in turn _info
is missing features=self.config.features
. I'll prepare a PR to fix this.
Note it may be useful to add the expected format of features
to the documentation for Builder Parameters
.
Oh good catch ! thanks
Note it may be useful to add the expected format of features to the documentation for Buil
Good idea, let me open a PR
Actually I just tried with datasets
on the main
branch and having features
defined in dataset_info
worked for me
>>> list(load_dataset("/Users/quentinlhoest/tmp", streaming=True, split="train"))
[{'txt': 'hello there\n', 'other': None}]
where tmp
contains data.tar with "hello there\n" in a text file and the README.md:
---
dataset_info:
features:
- name: txt
dtype: string
- name: other
dtype: string
---
This is a dataset card
What error did you get when you tried to specify the columns in dataset_info
?
If you review the changes in #7060 you'll note that features
are not passed to DatasetInfo
.
In your case the features are being extracted by this code.
Try with the Steps to reproduce the bug
. It's the same error mentioned in Describe the bug
because features
are not passed to DatasetInfo
.
features
are not used when the BuilderConfig
has no features
attribute. WebDataset
uses the default BuilderConfig
.
There is a warning that features
are ignored.
Note that as mentioned in Describe the bug
this could also be resolved by removing the check here because Arrow actually handles this itself, Arrow sets any missing fields to None
, at least in my case.
Note for anyone else who encounters this issue, every dataset type except folder-based types supported features in the documented manner; Arrow, csv, generator, json, pandas, parquet, spark, sql and text. WebDataset
is different and requires dataset_info
which is vaguely documented under dataset loading scripts.
Thanks for explaining. I see the Dataset Viewer is still failing - I'll update datasets
in the Viewer to fix this
Describe the bug
Consider a WebDataset with multiple images for each item where the number of images may vary: example
Due to this code an error is given.
The purpose of this check is unclear because PyArrow supports different keys.
Removing the check allows the dataset to be loaded and there's no issue when iterating through the dataset.
Steps to reproduce the bug
Expected behavior
Dataset loads without error
Environment info
datasets
version: 2.20.0huggingface_hub
version: 0.23.4fsspec
version: 2024.5.0