huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

Faulty datasets.exceptions.ExpectedMoreSplitsError #7282

Open meg-huggingface opened 2 weeks ago

meg-huggingface commented 2 weeks ago

Describe the bug

Trying to download only the 'validation' split of my dataset; instead hit the error datasets.exceptions.ExpectedMoreSplitsError. Appears to be the same undesired behavior as reported in #6939, but with data_files, not data_dir.

Here is the Traceback:

Traceback (most recent call last):
  File "/home/user/app/app.py", line 12, in <module>
    ds = load_dataset('datacomp/imagenet-1k-random0.0', token=GATED_IMAGENET, data_files={'validation': 'data/val*'}, split='validation', trust_remote_code=True)
  File "/usr/local/lib/python3.10/site-packages/datasets/load.py", line 2154, in load_dataset
    builder_instance.download_and_prepare(
  File "/usr/local/lib/python3.10/site-packages/datasets/builder.py", line 924, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.10/site-packages/datasets/builder.py", line 1018, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/usr/local/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 68, in verify_splits
    raise ExpectedMoreSplitsError(str(set(expected_splits) - set(recorded_splits)))
datasets.exceptions.ExpectedMoreSplitsError: {'train', 'test'}

Note: I am using the data_files argument only because I am trying to specify that I only want the 'validation' split, and the whole dataset will be downloaded even when the split='validation' argument is specified, unless you also specify data_files, as described here: https://discuss.huggingface.co/t/how-can-i-download-a-specific-split-of-a-dataset/79027

Steps to reproduce the bug

  1. Create a Space with the default blank 'gradio' SDK https://huggingface.co/new-space
  2. Create a file 'app.py' that loads a dataset to only extract a 'validation' split:

ds = load_dataset('datacomp/imagenet-1k-random0.0', token=GATED_IMAGENET, data_files={'validation': 'data/val*'}, split='validation', trust_remote_code=True)

Expected behavior

Downloading validation split.

Environment info

Default environment for creating a new Space. Relevant to this bug, that is:

FROM docker.io/library/python:3.10@sha256:fd0fa50d997eb56ce560c6e5ca6a1f5cf8fdff87572a16ac07fb1f5ca01eb608

--> RUN pip install --no-cache-dir pip==22.3.1 &&   pip install --no-cache-dir  datasets    "huggingface-hub>=0.19" "hf-transfer>=0.1.4" "protobuf<4" "click<8.1"