huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.31k stars 2.7k forks source link

[FSTimeoutError] load_dataset #7175

Closed cosmo3769 closed 1 month ago

cosmo3769 commented 2 months ago

Describe the bug

When using load_datasetto load HuggingFaceM4/VQAv2, I am getting FSTimeoutError.

Error

TimeoutError: 

The above exception was the direct cause of the following exception:

FSTimeoutError                            Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/fsspec/asyn.py](https://klh9mr78js-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20240924-060116_RC00_678132060#) in sync(loop, func, timeout, *args, **kwargs)
     99     if isinstance(return_result, asyncio.TimeoutError):
    100         # suppress asyncio.TimeoutError, raise FSTimeoutError
--> 101         raise FSTimeoutError from return_result
    102     elif isinstance(return_result, BaseException):
    103         raise return_result

FSTimeoutError:

It usually fails around 5-6 GB.

Screenshot 2024-09-26 at 9 10 19 PM

Steps to reproduce the bug

To reproduce it, run this in colab notebook:

!pip install -q -U datasets

from datasets import load_dataset
ds = load_dataset('HuggingFaceM4/VQAv2', split="train[:10%]")

Expected behavior

It should download properly.

Environment info

Using Colab Notebook.

cosmo3769 commented 2 months ago

Is this FSTimeoutError due to download network issue from remote resource (from where it is being accessed)?

crlotwhite commented 2 months ago

It seems to happen for all datasets, not just a specific one, and especially for versions after 3.0. (3.0.0, 3.0.1 have this problem)

I had the same error on a different dataset, but after downgrading to datasets==2.21.0, the problem was solved.

lhoestq commented 2 months ago

Same as https://github.com/huggingface/datasets/issues/7164

This dataset is made of a python script that downloads data from elsewhere than HF, so availability depends on the original host. Ultimately it would be nice to host the files of this dataset on HF

in datasets <3.0 there were lots of mechanisms that got removed after the decision to make datasets with python loading scripts legacy for security and maintenance reasons (we only do very basic support now)

cosmo3769 commented 1 month ago

@lhoestq Thank you for the clarification! Closing the issue.

Epiphero commented 1 month ago

I'm getting this too, and also at 5 minutes. But for CSTR-Edinburgh/vctk, so it's not just this dataset, it seems to be a timeout that was introduced and needs to be raised. The progress bar was moving along just fine before the timeout, and I get more or less of it depending on how fast the network is.

JonasLoos commented 1 month ago

You can change the aiohttp timeout from 5min to 1h like this:

import datasets, aiohttp
dataset = datasets.load_dataset(
    dataset_name,
    storage_options={'client_kwargs': {'timeout': aiohttp.ClientTimeout(total=3600)}}
)