huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.04k stars 2.64k forks source link

Dataset slice splits can't load training and validation at the same time #6320

Closed timlac closed 10 months ago

timlac commented 11 months ago

Describe the bug

According to the documentation is should be possible to run the following command:

train_test_ds = datasets.load_dataset("bookcorpus", split="train+test")

to load the train and test sets from the dataset.

However executing the equivalent code:

speech_commands_v1 = load_dataset("superb", "ks", split="train+test")

only yields the following output:

Dataset({ features: ['file', 'audio', 'label'], num_rows: 54175 })

Where loading the dataset without the split argument yields:

DatasetDict({ train: Dataset({ features: ['file', 'audio', 'label'], num_rows: 51094 }) validation: Dataset({ features: ['file', 'audio', 'label'], num_rows: 6798 }) test: Dataset({ features: ['file', 'audio', 'label'], num_rows: 3081 }) })

Thus, the API seems to be broken in this regard.

This is a bit annoying since I want to be able to use the split argument with split="train[:10%]+test[:10%]" to have smaller dataset to work with when validating my model is working correctly.

Steps to reproduce the bug

speech_commands_v1 = load_dataset("superb", "ks", split="train+test")

Expected behavior

DatasetDict({ train: Dataset({ features: ['file', 'audio', 'label'], num_rows: 51094 }) test: Dataset({ features: ['file', 'audio', 'label'], num_rows: 3081 }) })

Environment info

import datasets
print(datasets.__version__)

2.14.5

import sys
print(sys.version)

3.9.17 (main, Jul 5 2023, 20:41:20) [GCC 11.2.0]

mariosasko commented 11 months ago

The expression "train+test" concatenates the splits.

The individual splits as separate datasets can be obtained as follows:

train_ds, test_ds = load_dataset("<dataset_name>", split=["train", "test"])
train_10pct_ds, test_10pct_ds = load_dataset("<dataset_name>", split=["train[:10%]", "test[:%10]"])