Closed samsja closed 1 month ago
Hi @samsja, thanks for reporting and sorry for the delay. This timeout value is actually hard-coded to 100s in the datasets
library (see here). Do you think your workers are blocked in a way that requires more than 100s to complete the HTTP call?
cc @lhoestq who maintains datasets
I manage to solve my problem by using HF_HUB_ETAG_TIMEOUT=500
as an env variable.
Do you think your workers are blocked in a way that requires more than 100s to complete the HTTP call?
I guess yes since increasing the timeout allow my run to start.
Feel free to close the issue now that I have a working solution
Thanks for sharing your solution @samsja! I'll close this issue then :)
Describe the bug
hey, I am experiencing time out when downloading a dataset. I would like to be able to increase this time out, either having a longer default or via env variable.
Reproduction
I am using the following dataset
load_dataset("allenai/c4", "en", streaming=True)
in streaming mode and get the error below.This only happened when suing torchrun with 8 workers, using 2 workers is working. My guess is that the worker fight for bandwith leading to the time out when there are too many workers.
I actually "fix" the issue locally by patching the time out in this line: https://github.com/huggingface/huggingface_hub/blob/5ff2d150d121d04799b78bc08f2343c21b8f07a9/src/huggingface_hub/hf_api.py#L2306
I would like to increase this timeout in a more secure way.
Thanks in advance :pray:
Logs
System info