>>> from datasets import StreamingDownloadManager, load_dataset_builder
>>> builder = load_dataset_builder(path="blog_authorship_corpus")
Downloading builder script: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.60k/5.60k [00:00<00:00, 23.1MB/s]
Downloading metadata: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.81k/2.81k [00:00<00:00, 14.7MB/s]
Downloading readme: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.30k/7.30k [00:00<00:00, 30.8MB/s]
>>> dl_manager = StreamingDownloadManager(base_path=builder.base_path)
>>> builder._split_generators(dl_manager)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/slesage/.cache/huggingface/modules/datasets_modules/datasets/blog_authorship_corpus/6f5d78241afd8313111956f877a57db7a0e9fc6718255dc85df0928197feb683/blog_authorship_corpus.py", line 79, in _split_generators
data = dl_manager.download_and_extract(_DATA_URL)
File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 1087, in download_and_extract
return self.extract(self.download(url_or_urls))
File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 1039, in extract
urlpaths = map_nested(self._extract, url_or_urls, map_tuple=True)
File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 435, in map_nested
return function(data_struct)
File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 1044, in _extract
protocol = _get_extraction_protocol(urlpath, use_auth_token=self.download_config.use_auth_token)
File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 433, in _get_extraction_protocol
with fsspec.open(urlpath, **kwargs) as f:
File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 439, in open
return open_files(
File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 194, in __getitem__
out = super().__getitem__(item)
IndexError: list index out of range
Expected behavior
We should have an Exception raised by the datasets library.
Describe the bug
Dataset https://huggingface.co/datasets/blog_authorship_corpus has an issue with its hosting platform, since https://drive.google.com/u/0/uc?id=1cGy4RNDV87ZHEXbiozABr9gsSrZpPaPz&export=download returns 404 error.
But when trying to generate the split names, we get an exception which is now correctly caught.
Seen originally in https://github.com/huggingface/datasets-server/blob/adbdcd6710ffed4e2eb2e4cd905b5e0dff530a15/services/worker/src/worker/job_runners/config/parquet_and_info.py#L435
Steps to reproduce the bug
Expected behavior
We should have an Exception raised by the datasets library.
Environment info
datasets
version: 2.12.0