Timeout when downloading dataset metadata with 8 torchrun workers

samsja commented 2 months ago

Describe the bug

hey, I am experiencing time out when downloading a dataset. I would like to be able to increase this time out, either having a longer default or via env variable.

Reproduction

I am using the following dataset load_dataset("allenai/c4", "en", streaming=True) in streaming mode and get the error below.

This only happened when suing torchrun with 8 workers, using 2 workers is working. My guess is that the worker fight for bandwith leading to the time out when there are too many workers.

I actually "fix" the issue locally by patching the time out in this line: https://github.com/huggingface/huggingface_hub/blob/5ff2d150d121d04799b78bc08f2343c21b8f07a9/src/huggingface_hub/hf_api.py#L2306

I would like to increase this timeout in a more secure way.

Thanks in advance :pray:

Logs

File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2491, in repo_info
    return method(
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2363, in dataset_info
    r = get_session().get(path, headers=headers, timeout=timeout, params=params)
  File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 66, in send
    return super().send(request, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: (ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co

System info

- huggingface_hub version: 0.23.0
- Platform: Linux-6.5.0-15-generic-x86_64-with-glibc2.35
- Python version: 3.10.14
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /root/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers:
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.2.2
- Jinja2: 3.1.3
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 10.2.0
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.26.4
- pydantic: 1.10.15
- aiohttp: 3.9.5
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /root/.cache/huggingface/hub
- HF_ASSETS_CACHE: /root/.cache/huggingface/assets
- HF_TOKEN_PATH: /root/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10

Wauplin commented 1 month ago

Hi @samsja, thanks for reporting and sorry for the delay. This timeout value is actually hard-coded to 100s in the datasets library (see here). Do you think your workers are blocked in a way that requires more than 100s to complete the HTTP call?

cc @lhoestq who maintains datasets

samsja commented 1 month ago

I manage to solve my problem by using HF_HUB_ETAG_TIMEOUT=500 as an env variable.

Do you think your workers are blocked in a way that requires more than 100s to complete the HTTP call?

I guess yes since increasing the timeout allow my run to start.

Feel free to close the issue now that I have a working solution

Wauplin commented 1 month ago

Thanks for sharing your solution @samsja! I'll close this issue then :)

huggingface / huggingface_hub