huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.73k stars 2.59k forks source link

Uncaught exception when generating the splits from a dataset that miss data #5926

Open severo opened 1 year ago

severo commented 1 year ago

Describe the bug

Dataset https://huggingface.co/datasets/blog_authorship_corpus has an issue with its hosting platform, since https://drive.google.com/u/0/uc?id=1cGy4RNDV87ZHEXbiozABr9gsSrZpPaPz&export=download returns 404 error.

But when trying to generate the split names, we get an exception which is now correctly caught.

Seen originally in https://github.com/huggingface/datasets-server/blob/adbdcd6710ffed4e2eb2e4cd905b5e0dff530a15/services/worker/src/worker/job_runners/config/parquet_and_info.py#L435

Steps to reproduce the bug

>>> from datasets import StreamingDownloadManager, load_dataset_builder
>>> builder = load_dataset_builder(path="blog_authorship_corpus")
Downloading builder script: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.60k/5.60k [00:00<00:00, 23.1MB/s]
Downloading metadata: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.81k/2.81k [00:00<00:00, 14.7MB/s]
Downloading readme: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.30k/7.30k [00:00<00:00, 30.8MB/s]
>>> dl_manager = StreamingDownloadManager(base_path=builder.base_path)
>>> builder._split_generators(dl_manager)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/slesage/.cache/huggingface/modules/datasets_modules/datasets/blog_authorship_corpus/6f5d78241afd8313111956f877a57db7a0e9fc6718255dc85df0928197feb683/blog_authorship_corpus.py", line 79, in _split_generators
    data = dl_manager.download_and_extract(_DATA_URL)
  File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 1087, in download_and_extract
    return self.extract(self.download(url_or_urls))
  File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 1039, in extract
    urlpaths = map_nested(self._extract, url_or_urls, map_tuple=True)
  File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 435, in map_nested
    return function(data_struct)
  File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 1044, in _extract
    protocol = _get_extraction_protocol(urlpath, use_auth_token=self.download_config.use_auth_token)
  File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 433, in _get_extraction_protocol
    with fsspec.open(urlpath, **kwargs) as f:
  File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 439, in open
    return open_files(
  File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 194, in __getitem__
    out = super().__getitem__(item)
IndexError: list index out of range

Expected behavior

We should have an Exception raised by the datasets library.

Environment info

albertvillanova commented 1 year ago

Thanks for reporting, @severo.

This is a known issue with fsspec: