Dataset slice splits can't load training and validation at the same time

Describe the bug

According to the documentation is should be possible to run the following command:

train_test_ds = datasets.load_dataset("bookcorpus", split="train+test")

to load the train and test sets from the dataset.

However executing the equivalent code:

speech_commands_v1 = load_dataset("superb", "ks", split="train+test")

only yields the following output:

Dataset({ features: ['file', 'audio', 'label'], num_rows: 54175 })

Where loading the dataset without the split argument yields:

DatasetDict({ train: Dataset({ features: ['file', 'audio', 'label'], num_rows: 51094 }) validation: Dataset({ features: ['file', 'audio', 'label'], num_rows: 6798 }) test: Dataset({ features: ['file', 'audio', 'label'], num_rows: 3081 }) })

Thus, the API seems to be broken in this regard.

This is a bit annoying since I want to be able to use the split argument with split="train[:10%]+test[:10%]" to have smaller dataset to work with when validating my model is working correctly.

Steps to reproduce the bug

speech_commands_v1 = load_dataset("superb", "ks", split="train+test")

Expected behavior

DatasetDict({ train: Dataset({ features: ['file', 'audio', 'label'], num_rows: 51094 }) test: Dataset({ features: ['file', 'audio', 'label'], num_rows: 3081 }) })

Environment info

import datasets
print(datasets.__version__)

2.14.5

import sys
print(sys.version)

3.9.17 (main, Jul 5 2023, 20:41:20) [GCC 11.2.0]

huggingface / datasets