huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.14k stars 559 forks source link

Lock acquisiton fails on download #2543

Open AlpinDale opened 2 months ago

AlpinDale commented 2 months ago

Describe the bug

I've been trying to download NousResearch/Meta-Llama-3.1-8B-Instruct with and without hf-transfer, but it consistently hangs at the 10GB point (2 shards with hf-transfer, half of each without), with this message being repeated every few seconds:

still waiting to acquire lock on /home/austin/.cache/huggingface/hub/.locks/models--NousResearch--Meta-Llama-3.1-8B-Instruct/fc1cdddd6bfa91128d6e94ee73d0ce62bfcdb7af29e978ddcab30c66ae9ea7fa.lock

Reproduction

pip install -U huggingface-hub[cli] hf-transfer

HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download NousResearch/Meta-Llama-3.1-8B-Instruct --exclude *.pth

Logs

$ huggingface-cli download NousResearch/Meta-Llama-3.1-8B-Instruct --exclude *.pth
Downloading '.gitattributes' to '/home/austin/.cache/huggingface/hub/models--NousResearch--Meta-Llama-3.1-8B-Instruct/blobs/a6344aac8c09253b3b630fb776ae94478aa0275b.incomplete'
.gitattributes: 100%|███████████████████████████████████████████████████████████████████████████| 1.52k/1.52k [00:00<00:00, 12.4MB/s]
Download complete. Moving file to /home/austin/.cache/huggingface/hub/models--NousResearch--Meta-Llama-3.1-8B-Instruct/blobs/a6344aac8c09253b3b630fb776ae94478aa0275b
Downloading 'LICENSE' to '/home/austin/.cache/huggingface/hub/models--NousResearch--Meta-Llama-3.1-8B-Instruct/blobs/a7c3ca16cee30425ed6ad841a809590f2bcbf290.incomplete'
LICENSE: 100%|██████████████████████████████████████████████████████████████████████████████████| 7.63k/7.63k [00:00<00:00, 22.6MB/s]
Download complete. Moving file to /home/austin/.cache/huggingface/hub/models--NousResearch--Meta-Llama-3.1-8B-Instruct/blobs/a7c3ca16cee30425ed6ad841a809590f2bcbf290
Downloading 'README.md' to '/home/austin/.cache/huggingface/hub/models--NousResearch--Meta-Llama-3.1-8B-Instruct/blobs/71ce2f59177b48e3da2ac1b559393f4fcd9b3ea1.incomplete'
README.md: 100%|████████████████████████████████████████████████████████████████████████████████| 41.8k/41.8k [00:00<00:00, 63.5MB/s]
Download complete. Moving file to /home/austin/.cache/huggingface/hub/models--NousResearch--Meta-Llama-3.1-8B-Instruct/blobs/71ce2f59177b48e3da2ac1b559393f4fcd9b3ea1
Downloading 'USE_POLICY.md' to '/home/austin/.cache/huggingface/hub/models--NousResearch--Meta-Llama-3.1-8B-Instruct/blobs/81ebb55902285e8dd5804ccf423d17ffb2a622ee.incomplete'
USE_POLICY.md: 100%|████████████████████████████████████████████████████████████████████████████| 4.69k/4.69k [00:00<00:00, 12.7MB/s]
Download complete. Moving file to /home/austin/.cache/huggingface/hub/models--NousResearch--Meta-Llama-3.1-8B-Instruct/blobs/81ebb55902285e8dd5804ccf423d17ffb2a622ee
Downloading 'config.json' to '/home/austin/.cache/huggingface/hub/models--NousResearch--Meta-Llama-3.1-8B-Instruct/blobs/0bb6fd75b3ad2fe988565929f329945262c2814e.incomplete'
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████| 855/855 [00:00<00:00, 3.32MB/s]
Download complete. Moving file to /home/austin/.cache/huggingface/hub/models--NousResearch--Meta-Llama-3.1-8B-Instruct/blobs/0bb6fd75b3ad2fe988565929f329945262c2814e
Downloading 'generation_config.json' to '/home/austin/.cache/huggingface/hub/models--NousResearch--Meta-Llama-3.1-8B-Instruct/blobs/cc7276afd599de091142c6ed3005faf8a74aa257.incomplete'
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████| 184/184 [00:00<00:00, 735kB/s]
Download complete. Moving file to /home/austin/.cache/huggingface/hub/models--NousResearch--Meta-Llama-3.1-8B-Instruct/blobs/cc7276afd599de091142c6ed3005faf8a74aa257
Downloading 'model-00001-of-00004.safetensors' to '/home/austin/.cache/huggingface/hub/models--NousResearch--Meta-Llama-3.1-8B-Instruct/blobs/2b1879f356aed350030bb40eb45ad362c89d9891096f79a3ab323d3ba5607668.incomplete'
model-00001-of-00004.safetensors: 100%|█████████████████████████████████████████████████████████▉| 4.98G/4.98G [00:11<00:00, 443MB/s]
Download complete. Moving file to /home/austin/.cache/huggingface/hub/models--NousResearch--Meta-Llama-3.1-8B-Instruct/blobs/2b1879f356aed350030bb40eb45ad362c89d9891096f79a3ab323d3ba5607668
Downloading 'model-00002-of-00004.safetensors' to '/home/austin/.cache/huggingface/hub/models--NousResearch--Meta-Llama-3.1-8B-Instruct/blobs/09d433f650646834a83c580877bd60c6d1f88f7755305c12576b5c7058f9af15.incomplete'
model-00002-of-00004.safetensors: 100%|█████████████████████████████████████████████████████████▉| 5.00G/5.00G [00:08<00:00, 600MB/s]
Download complete. Moving file to /home/austin/.cache/huggingface/hub/models--NousResearch--Meta-Llama-3.1-8B-Instruct/blobs/09d433f650646834a83c580877bd60c6d1f88f7755305c12576b5c7058f9af15
still waiting to acquire lock on /home/austin/.cache/huggingface/hub/.locks/models--NousResearch--Meta-Llama-3.1-8B-Instruct/fc1cdddd6bfa91128d6e94ee73d0ce62bfcdb7af29e978ddcab30c66ae9ea7fa.lock
still waiting to acquire lock on /home/austin/.cache/huggingface/hub/.locks/models--NousResearch--Meta-Llama-3.1-8B-Instruct/fc1cdddd6bfa91128d6e94ee73d0ce62bfcdb7af29e978ddcab30c66ae9ea7fa.lock
still waiting to acquire lock on /home/austin/.cache/huggingface/hub/.locks/models--NousResearch--Meta-Llama-3.1-8B-Instruct/fc1cdddd6bfa91128d6e94ee73d0ce62bfcdb7af29e978ddcab30c66ae9ea7fa.lock
still waiting to acquire lock on /home/austin/.cache/huggingface/hub/.locks/models--NousResearch--Meta-Llama-3.1-8B-Instruct/fc1cdddd6bfa91128d6e94ee73d0ce62bfcdb7af29e978ddcab30c66ae9ea7fa.lock
still waiting to acquire lock on /home/austin/.cache/huggingface/hub/.locks/models--NousResearch--Meta-Llama-3.1-8B-Instruct/fc1cdddd6bfa91128d6e94ee73d0ce62bfcdb7af29e978ddcab30c66ae9ea7fa.lock
still waiting to acquire lock on /home/austin/.cache/huggingface/hub/.locks/models--NousResearch--Meta-Llama-3.1-8B-Instruct/fc1cdddd6bfa91128d6e94ee73d0ce62bfcdb7af29e978ddcab30c66ae9ea7fa.lock
still waiting to acquire lock on /home/austin/.cache/huggingface/hub/.locks/models--NousResearch--Meta-Llama-3.1-8B-Instruct/fc1cdddd6bfa91128d6e94ee73d0ce62bfcdb7af29e978ddcab30c66ae9ea7fa.lock
still waiting to acquire lock on /home/austin/.cache/huggingface/hub/.locks/models--NousResearch--Meta-Llama-3.1-8B-Instruct/fc1cdddd6bfa91128d6e94ee73d0ce62bfcdb7af29e978ddcab30c66ae9ea7fa.lock
^CTraceback (most recent call last):
  File "/home/austin/disk1/aphrodite-engine/conda/envs/aphrodite-runtime/bin/huggingface-cli", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/austin/disk1/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/huggingface_hub/commands/huggingface_cli.py", line 52, in main
    service.run()
  File "/home/austin/disk1/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/huggingface_hub/commands/download.py", line 146, in run
    print(self._download())  # Print path to downloaded files
          ^^^^^^^^^^^^^^^^
  File "/home/austin/disk1/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/huggingface_hub/commands/download.py", line 180, in _download
    return snapshot_download(
           ^^^^^^^^^^^^^^^^^^
  File "/home/austin/disk1/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/austin/disk1/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/huggingface_hub/_snapshot_download.py", line 297, in snapshot_download
    _inner_hf_hub_download(file)
  File "/home/austin/disk1/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/huggingface_hub/_snapshot_download.py", line 273, in _inner_hf_hub_download
    return hf_hub_download(
           ^^^^^^^^^^^^^^^^
  File "/home/austin/disk1/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/austin/disk1/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/austin/disk1/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1240, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/austin/disk1/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1388, in _hf_hub_download_to_cache_dir
    with WeakFileLock(lock_path):
  File "/home/austin/disk1/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/home/austin/disk1/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/huggingface_hub/utils/_fixes.py", line 91, in WeakFileLock
    lock.acquire()
  File "/home/austin/disk1/aphrodite-engine/conda/envs/aphrodite-runtime/lib/python3.11/site-packages/filelock/_api.py", line 344, in acquire
    time.sleep(poll_interval)
KeyboardInterrupt

System info

- huggingface_hub version: 0.24.7
- Platform: Linux-5.15.0-119-generic-x86_64-with-glibc2.35
- Python version: 3.11.9
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /home/austin/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: alpindale
- Configured git credential helpers: store
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.4.0
- Jinja2: 3.1.4
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 10.4.0
- hf_transfer: 0.1.8
- gradio: N/A
- tensorboard: N/A
- numpy: 1.26.4
- pydantic: 2.8.2
- aiohttp: 3.10.5
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /home/austin/.cache/huggingface/hub
- HF_ASSETS_CACHE: /home/austin/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/austin/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
Wauplin commented 2 months ago

Hi @AlpinDale, sorry for the inconvenience. What type of hard-drive is it? (quite classic or a special mounted drive?). Asking because filelock doesn't always work properly on some filesystems. Independently from that, you can try to kill all huggingface_hub/hf_transfer processes and then run rm -rf /home/austin/.cache/huggingface/hub/.locks to delete all current locks. This should fix your issues ( :crossed_fingers: ), though I can't explain why it happened in the first place.

JakubCzarlinski commented 2 weeks ago

Same issue here. Tried to delete the .locks but it unfortunately didn't help. Instead, reducing the --max-workers to something like 2 worked.

EG:

huggingface-cli download stabilityai/stable-diffusion-3.5-medium --max-workers 2

This is without using hf_transfer, and for a different model. In my case this did not hinder performance, but I imagine that varies much on your network speed.

EDIT: Spoke too soon. Didn't solve however reduced the frequency at least.