Open timonmerk opened 1 month ago
Hi ! If you check the dataset loading script here you'll see that it downloads the data from OpenSLR, and apparently their storage has timeout issues. It would be great to ultimately host the dataset on Hugging Face instead.
In the meantime I can only recommend to try again later :/
Ok, still many thanks!
I'm also getting this same error but for CSTR-Edinburgh/vctk
, so I don't think it's the remote host that's timing out, since I also time out at exactly 5 minutes. It seems there is a universal fsspec timeout that's getting hit starting in v3.
in v3 we cleaned the download parts of the library to make it more robust for HF downloads and to simplify support of script-based datasets. As a side effect it's not the same code that is used for other hosts, maybe time out handling changed. Anyway it should be possible to tweak fsspec to use retries
For example using aiohttp_retry maybe (haven't tried) ?
import fsspec
from aiohttp_retry import RetryClient
fsspec.filesystem("http")._session = RetryClient()
related topic : https://github.com/huggingface/datasets/issues/7175
Adding a timeout argument to the fs.get_file
call in fsspec_get
in datasets/utils/file_utils.py
might fix this (source code):
fs.get_file(path, temp_file.name, callback=callback, timeout=3600)
Setting timeout=1
fails after about one second, so setting it to 3600 should give us 1h. Havn't really tested this though. I'm also not sure what implications this has and if it causes errors for other fs
implementations/configurations.
This is using datasets==3.0.1
and Python 3.11.6.
Edit: This doesn't seem to change the timeout time, but add a second timeout counter (probably in fsspec/asyn.py/sync
). So one can reduce the time for downloading like this, but not expand.
Edit 2: fs
is of type fsspec.implementations.http.HTTPFileSystem
which initializes a aiohttp.ClientSession
using client_kwargs
. We can pass these when calling load_dataset
.
TLDR; This fixes it:
import datasets, aiohttp
dataset = datasets.load_dataset(
dataset_name,
storage_options={'client_kwargs': {'timeout': aiohttp.ClientTimeout(total=3600)}}
)
Describe the bug
I am trying to download the
librispeech_asr
clean
dataset, which results in aFSTimeoutError
exception after downloading around 61% of the data.Steps to reproduce the bug
The output is as follows:
Expected behavior
Complete the download
Environment info
Python version 3.12.6
Dependencies:
MacOS 14.6.1 (23G93)