huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.06k stars 543 forks source link

TQDM: AttibuteError del tqdm_class._lock on async / threaded usage #1994

Closed michaelfeil closed 9 months ago

michaelfeil commented 9 months ago

Describe the bug

I somehow get a race condition, when downloading the model in threads. This happens around 1/100 runs in the CI.

This is pretty much how my code looks like:

from huggingface_hub import snapshot_download
from huggingface_hub.utils import disable_progress_bars
import os
import asyncio 

disable_progress_bars()
os.environ.setdefault("HF_HUB_DISABLE_PROGRESS_BARS", "True")

def download_sync(model_name):
      return huggingface_hub.snapshot_download(model_name)

async def download(model_name):
      return await asyncio.to_thread(download_sync, model_name)

async def collect()
      asyncio.gather([download("model1"), download("model2")]

asyncio.run(collect())

Sorry for this ugly stacktrace from GCP Cloud build:

2024-01-19 01:11:22.306 CET
Step #1 - "build-image": mydownloader/downloader.py:50: in async_download
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":     self._model_full_path = await asyncio.to_thread(
2024-01-19 01:11:22.306 CET
Step #1 - "build-image": /usr/lib/python3.10/asyncio/threads.py:25: in to_thread
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":     return await loop.run_in_executor(None, func_call)
2024-01-19 01:11:22.306 CET
Step #1 - "build-image": /usr/lib/python3.10/concurrent/futures/thread.py:58: in run
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":     result = self.fn(*self.args, **self.kwargs)
2024-01-19 01:11:22.306 CET
Step #1 - "build-image": mydownloader/downloader.py:72: in _download
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":     downloaded_path: str = snapshot_download(
2024-01-19 01:11:22.306 CET
Step #1 - "build-image": /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py:118: in _inner_fn
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":     return fn(*args, **kwargs)
2024-01-19 01:11:22.306 CET
Step #1 - "build-image": /usr/local/lib/python3.10/dist-packages/huggingface_hub/_snapshot_download.py:239: in snapshot_download
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":     thread_map(
2024-01-19 01:11:22.306 CET
Step #1 - "build-image": /usr/local/lib/python3.10/dist-packages/tqdm/contrib/concurrent.py:69: in thread_map
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":     return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
2024-01-19 01:11:22.306 CET
Step #1 - "build-image": /usr/local/lib/python3.10/dist-packages/tqdm/contrib/concurrent.py:47: in _executor_map
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":     with ensure_lock(tqdm_class, lock_name=lock_name) as lk:
2024-01-19 01:11:22.306 CET
Step #1 - "build-image": /usr/lib/python3.10/contextlib.py:142: in __exit__
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":     next(self.gen)
2024-01-19 01:11:22.306 CET
Step #1 - "build-image": _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2024-01-19 01:11:22.306 CET
Step #1 - "build-image": 
2024-01-19 01:11:22.306 CET
Step #1 - "build-image": tqdm_class = <class 'huggingface_hub.utils.tqdm.tqdm'>, lock_name = ''
2024-01-19 01:11:22.306 CET
Step #1 - "build-image": 
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":     @contextmanager
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":     def ensure_lock(tqdm_class, lock_name=""):
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":         """get (create if necessary) and then restore `tqdm_class`'s lock"""
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":         old_lock = getattr(tqdm_class, '_lock', None)  # don't create a new lock
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":         lock = old_lock or tqdm_class.get_lock()  # maybe create a new lock
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":         lock = getattr(lock, lock_name, lock)  # maybe subtype
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":         tqdm_class.set_lock(lock)
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":         yield lock
2024-01-19 01:11:22.306 CET
Step #1 - "build-image":         if old_lock is None:
2024-01-19 01:11:22.306 CET
Step #1 - "build-image": >           del tqdm_class._lock
2024-01-19 01:11:22.307 CET
Step #1 - "build-image": E           AttributeError: _lock
2024-01-19 01:11:22.307 CET
Step #1 - "build-image": 
2024-01-19 01:11:22.307 CET
Step #1 - "build-image": /usr/local/lib/python3.10/dist-packages/tqdm/contrib/concurrent.py:24: AttributeError
2024-01-19 01:11:22.307 CET
Step #1 - "build-image": ---

Reproduction

No response

Logs

No response

System info

hf-transfer = "0.1.3"
huggingface-hub = "0.17.1"
python = "~3.10.2"
Wauplin commented 9 months ago

@michaelfeil Thanks for reporting and sorry for the inconvenience. Looks like a known and fixed issue (see https://github.com/huggingface/huggingface_hub/pull/1629). Could you update your huggingface_hub version to a more recent one (latest is 0.20.2). It should solve your issue :)

michaelfeil commented 9 months ago

I'll give it a shot to bump the version.

Wauplin commented 9 months ago

Great! Please let me know if the issue arise again and I'll reopen :)