huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.26k stars 2.7k forks source link

fsspec.exceptions.FSTimeoutError when downloading dataset #7164

Open timonmerk opened 1 month ago

timonmerk commented 1 month ago

Describe the bug

I am trying to download the librispeech_asr clean dataset, which results in a FSTimeoutError exception after downloading around 61% of the data.

Steps to reproduce the bug

import datasets
datasets.load_dataset("librispeech_asr", "clean")

The output is as follows:

Downloading data: 61%|██████████████▋ | 3.92G/6.39G [05:00<03:06, 13.2MB/s]Traceback (most recent call last): File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/fsspec/asyn.py", line 56, in _runner result[0] = await coro ^^^^^^^^^^ File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/fsspec/implementations/http.py", line 262, in _get_file chunk = await r.content.read(chunk_size) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/aiohttp/streams.py", line 393, in read await self._wait("read") File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/aiohttp/streams.py", line 311, in _wait with self._timer: ^^^^^^^^^^^ File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/aiohttp/helpers.py", line 713, in exit raise asyncio.TimeoutError from None TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/load_dataset.py", line 3, in datasets.load_dataset("librispeech_asr", "clean") File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/load.py", line 2096, in load_dataset builder_instance.download_and_prepare( File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/builder.py", line 924, in download_and_prepare self._download_and_prepare( File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/builder.py", line 1647, in _download_and_prepare super()._download_and_prepare( File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/builder.py", line 977, in _download_and_prepare split_generators = self._split_generators(dl_manager, *split_generators_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/Timon/.cache/huggingface/modules/datasets_modules/datasets/librispeech_asr/2712a8f82f0d20807a56faadcd08734f9bdd24c850bb118ba21ff33ebff0432f/librispeech_asr.py", line 115, in _split_generators archive_path = dl_manager.download(_DL_URLS[self.config.name]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/download/download_manager.py", line 159, in download downloaded_path_or_paths = map_nested( ^^^^^^^^^^^ File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/utils/py_utils.py", line 512, in map_nested _single_map_nested((function, obj, batched, batch_size, types, None, True, None)) File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/utils/py_utils.py", line 380, in _single_map_nested return [mapped_item for batch in iter_batched(data_struct, batch_size) for mapped_item in function(batch)] ^^^^^^^^^^^^^^^ File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/download/download_manager.py", line 216, in _download_batched self._download_single(url_or_filename, download_config=download_config) File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/download/download_manager.py", line 225, in _download_single out = cached_path(url_or_filename, download_config=download_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 205, in cached_path output_path = get_from_cache( ^^^^^^^^^^^^^^^ File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 415, in get_from_cache fsspec_get(url, temp_file, storage_options=storage_options, desc=download_desc, disable_tqdm=disable_tqdm) File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 334, in fsspec_get fs.get_file(path, temp_file.name, callback=callback) File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/fsspec/asyn.py", line 118, in wrapper return sync(self.loop, func, args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/fsspec/asyn.py", line 101, in sync raise FSTimeoutError from return_result fsspec.exceptions.FSTimeoutError Downloading data: 61%|██████████████▋ | 3.92G/6.39G [05:00<03:09, 13.0MB/s]

Expected behavior

Complete the download

Environment info

Python version 3.12.6

Dependencies:

dependencies = [ "accelerate>=0.34.2", "datasets[audio]>=3.0.0", "ipython>=8.18.1", "librosa>=0.10.2.post1", "torch>=2.4.1", "torchaudio>=2.4.1", "transformers>=4.44.2", ]

MacOS 14.6.1 (23G93)

lhoestq commented 1 month ago

Hi ! If you check the dataset loading script here you'll see that it downloads the data from OpenSLR, and apparently their storage has timeout issues. It would be great to ultimately host the dataset on Hugging Face instead.

In the meantime I can only recommend to try again later :/

timonmerk commented 1 month ago

Ok, still many thanks!

Epiphero commented 3 weeks ago

I'm also getting this same error but for CSTR-Edinburgh/vctk, so I don't think it's the remote host that's timing out, since I also time out at exactly 5 minutes. It seems there is a universal fsspec timeout that's getting hit starting in v3.

lhoestq commented 3 weeks ago

in v3 we cleaned the download parts of the library to make it more robust for HF downloads and to simplify support of script-based datasets. As a side effect it's not the same code that is used for other hosts, maybe time out handling changed. Anyway it should be possible to tweak fsspec to use retries

For example using aiohttp_retry maybe (haven't tried) ?

import fsspec
from aiohttp_retry import RetryClient

fsspec.filesystem("http")._session = RetryClient()

related topic : https://github.com/huggingface/datasets/issues/7175

JonasLoos commented 3 weeks ago

Adding a timeout argument to the fs.get_file call in fsspec_get in datasets/utils/file_utils.py might fix this (source code):

fs.get_file(path, temp_file.name, callback=callback, timeout=3600)

Setting timeout=1 fails after about one second, so setting it to 3600 should give us 1h. Havn't really tested this though. I'm also not sure what implications this has and if it causes errors for other fs implementations/configurations.

This is using datasets==3.0.1 and Python 3.11.6.


Edit: This doesn't seem to change the timeout time, but add a second timeout counter (probably in fsspec/asyn.py/sync). So one can reduce the time for downloading like this, but not expand.


Edit 2: fs is of type fsspec.implementations.http.HTTPFileSystem which initializes a aiohttp.ClientSession using client_kwargs. We can pass these when calling load_dataset.

TLDR; This fixes it:

import datasets, aiohttp
dataset = datasets.load_dataset(
    dataset_name,
    storage_options={'client_kwargs': {'timeout': aiohttp.ClientTimeout(total=3600)}}
)