hlky commented 3 months ago

Describe the bug

Consider a WebDataset with multiple images for each item where the number of images may vary: example

Due to this code an error is given.

The TAR archives of the dataset should be in WebDataset format, but the files in the archive don't share the same prefix or the same types.

The purpose of this check is unclear because PyArrow supports different keys.

Removing the check allows the dataset to be loaded and there's no issue when iterating through the dataset.

>>> from datasets import load_dataset
>>> path = "shards/*.tar"
>>> dataset = load_dataset("webdataset", data_files={"train": path}, split="train", streaming=True)
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 152/152 [00:00<00:00, 56458.93it/s]
>>> dataset
IterableDataset({
    features: ['__key__', '__url__', '1.jpg', '2.jpg', '3.jpg', '4.jpg', 'json'],
    n_shards: 152
})

Steps to reproduce the bug

from datasets import load_dataset
load_dataset("bigdata-pw/fashion-150k")

Expected behavior

Dataset loads without error

Environment info

datasets version: 2.20.0
Platform: Linux-5.14.0-467.el9.x86_64-x86_64-with-glibc2.34
Python version: 3.9.19
huggingface_hub version: 0.23.4
PyArrow version: 17.0.0
Pandas version: 2.2.2
fsspec version: 2024.5.0

lhoestq commented 3 months ago

Since datasets uses is built on Arrow to store the data, it requires each sample to have the same columns.

This can be fixed by specifyign in advance the name of all the possible columns in the dataset_info in YAML, and missing values will be None

hlky commented 3 months ago

Thanks. This currently doesn't work for WebDataset because there's no BuilderConfig with features and in turn _info is missing features=self.config.features. I'll prepare a PR to fix this.

Note it may be useful to add the expected format of features to the documentation for Builder Parameters.

lhoestq commented 3 months ago

Oh good catch ! thanks

Note it may be useful to add the expected format of features to the documentation for Buil

Good idea, let me open a PR

hlky commented 3 months ago

7060

lhoestq commented 3 months ago

Actually I just tried with datasets on the main branch and having features defined in dataset_info worked for me

>>> list(load_dataset("/Users/quentinlhoest/tmp", streaming=True, split="train"))
[{'txt': 'hello there\n', 'other': None}]

where tmp contains data.tar with "hello there\n" in a text file and the README.md:

---
dataset_info:
  features:
  - name: txt
    dtype: string
  - name: other
    dtype: string
---

This is a dataset card

What error did you get when you tried to specify the columns in dataset_info ?

hlky commented 3 months ago

If you review the changes in #7060 you'll note that features are not passed to DatasetInfo.

In your case the features are being extracted by this code.

Try with the Steps to reproduce the bug. It's the same error mentioned in Describe the bug because features are not passed to DatasetInfo.

features are not used when the BuilderConfig has no features attribute. WebDataset uses the default BuilderConfig.

There is a warning that features are ignored.

Note that as mentioned in Describe the bug this could also be resolved by removing the check here because Arrow actually handles this itself, Arrow sets any missing fields to None, at least in my case.

hlky commented 3 months ago

Note for anyone else who encounters this issue, every dataset type except folder-based types supported features in the documented manner; Arrow, csv, generator, json, pandas, parquet, spark, sql and text. WebDataset is different and requires dataset_info which is vaguely documented under dataset loading scripts.

lhoestq commented 3 months ago

Thanks for explaining. I see the Dataset Viewer is still failing - I'll update datasets in the Viewer to fix this

huggingface / datasets

WebDataset with different prefixes are unsupported #7055

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

7060