huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.14k stars 559 forks source link

Oversized cache folder, multiple downloads of the same checkpoints. #1922

Closed jquintanilla4 closed 11 months ago

jquintanilla4 commented 11 months ago

Describe the bug

When I use HF diffusers to download a model/checkpoint from HF using the .from_pretrained() method, it downloads the same model/checkpoint over and over, usually on different days or different times of the day. Between those repeated downloads, there's no updates to the checkpoint/model I try to download. So over time I end up with a cache folder for sdxl-turbo that is 20GB. But when i run the huggingface-cli delete-cache, it only gives me the option of deleting the entire cache folder for that model/checkpoint.

As an example here's the last two checkpoints when i run the huggingface-cli scan-cache command:

stabilityai/sdxl-turbo                      model            20.5G       19 a few seconds ago a few seconds ago main /home/xxxx/.cache/huggingface/hub/models--stabilityai--sdxl-turbo
stabilityai/stable-diffusion-xl-base-1.0    model            21.3G       22 1 day ago         1 week ago        main /home/xxxx/.cache/huggingface/hub/models--stabilityai--stable-diffusion-xl-base-1.0

I use local_folders, if i remember correctly. I normally work on a jupyter notebook file (.ipynb) inside of VS code.

Output from the following command tree -alh ~/.cache/huggingface/hub/models--stabilityai--sdxl-turbo:

[4.0K]  /home/remko/.cache/huggingface/hub/models--stabilityai--sdxl-turbo
├── [4.0K]  blobs
│   ├── [160M]  02ee4bd18e5d16e7fe5fc5b85b4aefa2cba6db28897f674226c9d6ddd2f34f06
│   ├── [ 459]  0359d7abb0b9c7b4a433be2db87cefea03c06ea5
│   ├── [9.6G]  1968fc61aa8449ab3d3f9b9a05bce88c611760c01e0c4a7a3785911b546fe582
│   ├── [ 704]  1bf819b6621086dc92428e2c9c8bbab39211fd55
│   ├── [ 586]  1f467f3b057c46a21f88d9ab2f1070af9916c78c
│   ├── [1.7K]  220d7ae3e59ce3484c7eeb47ef2ac9db5097e29a
│   ├── [1.0M]  469be27c5c010538f845f518c4f5e8574c78f7c8
│   ├── [4.8G]  48fa46161a745f48d4054df3fe13804ee255486bca893403b60373c188fd1bdb
│   ├── [235M]  660c6f5b1abae9dc498ac2d21e1347d2abdb0cf6c0c0c8576cd796491d9a6cdd
│   ├── [512K]  76e821f1b6f0a9709293c3b6b51ed90980b3166b
│   ├── [469M]  778d02eb9e707c3fbaae0b67b79ea0d1399b52e624fb634f2f19375ae7c047c3
│   ├── [ 565]  8e91c97936ad0b2c1356f03de8d47589b5232704
│   ├── [ 460]  ae0c5be6f35217e51c4c000fd325d8de0294e99c
│   ├── [ 607]  ae14cf90e29b12134a53383691c98c73dee5d422
│   ├── [ 855]  bd2abe19377557ff5771584921f9b65fa041fef0
│   ├── [1.3G]  ec310df2af79c318e24d20511b601a591ca8cd4f1fce1d8dff822a356bcdb1f4
│   ├── [ 685]  f857cf2f828fff2ee319b1a47e6ce820e8affb9d
│   ├── [ 575]  f9e084535c55110233f44ccc6c7f9d0e1540f8be
│   └── [2.6G]  fa5b2e6f4c2efc2d82e4b8312faec1a5540eabfc6415126c9a05c8436a530ef4
├── [4.0K]  refs
│   └── [  40]  main
└── [4.0K]  snapshots
    └── [4.0K]  f4b0486b498f84668e828044de1d0c8ba486e05b
        ├── [  52]  model_index.json -> ../../blobs/f857cf2f828fff2ee319b1a47e6ce820e8affb9d
        ├── [4.0K]  scheduler
        │   └── [  55]  scheduler_config.json -> ../../../blobs/0359d7abb0b9c7b4a433be2db87cefea03c06ea5
        ├── [4.0K]  text_encoder
        │   ├── [  55]  config.json -> ../../../blobs/8e91c97936ad0b2c1356f03de8d47589b5232704
        │   ├── [  79]  model.fp16.safetensors -> ../../../blobs/660c6f5b1abae9dc498ac2d21e1347d2abdb0cf6c0c0c8576cd796491d9a6cdd
        │   └── [  79]  model.safetensors -> ../../../blobs/778d02eb9e707c3fbaae0b67b79ea0d1399b52e624fb634f2f19375ae7c047c3
        ├── [4.0K]  text_encoder_2
        │   ├── [  55]  config.json -> ../../../blobs/f9e084535c55110233f44ccc6c7f9d0e1540f8be
        │   ├── [  79]  model.fp16.safetensors -> ../../../blobs/ec310df2af79c318e24d20511b601a591ca8cd4f1fce1d8dff822a356bcdb1f4
        │   └── [  79]  model.safetensors -> ../../../blobs/fa5b2e6f4c2efc2d82e4b8312faec1a5540eabfc6415126c9a05c8436a530ef4
        ├── [4.0K]  tokenizer
        │   ├── [  55]  merges.txt -> ../../../blobs/76e821f1b6f0a9709293c3b6b51ed90980b3166b
        │   ├── [  55]  special_tokens_map.json -> ../../../blobs/1f467f3b057c46a21f88d9ab2f1070af9916c78c
        │   ├── [  55]  tokenizer_config.json -> ../../../blobs/1bf819b6621086dc92428e2c9c8bbab39211fd55
        │   └── [  55]  vocab.json -> ../../../blobs/469be27c5c010538f845f518c4f5e8574c78f7c8
        ├── [4.0K]  tokenizer_2
        │   ├── [  55]  merges.txt -> ../../../blobs/76e821f1b6f0a9709293c3b6b51ed90980b3166b
        │   ├── [  55]  special_tokens_map.json -> ../../../blobs/ae0c5be6f35217e51c4c000fd325d8de0294e99c
        │   ├── [  55]  tokenizer_config.json -> ../../../blobs/bd2abe19377557ff5771584921f9b65fa041fef0
        │   └── [  55]  vocab.json -> ../../../blobs/469be27c5c010538f845f518c4f5e8574c78f7c8
        ├── [4.0K]  unet
        │   ├── [  55]  config.json -> ../../../blobs/220d7ae3e59ce3484c7eeb47ef2ac9db5097e29a
        │   ├── [  79]  diffusion_pytorch_model.fp16.safetensors -> ../../../blobs/48fa46161a745f48d4054df3fe13804ee255486bca893403b60373c188fd1bdb
        │   └── [  79]  diffusion_pytorch_model.safetensors -> ../../../blobs/1968fc61aa8449ab3d3f9b9a05bce88c611760c01e0c4a7a3785911b546fe582
        └── [4.0K]  vae
            ├── [  55]  config.json -> ../../../blobs/ae14cf90e29b12134a53383691c98c73dee5d422
            └── [  79]  diffusion_pytorch_model.fp16.safetensors -> ../../../blobs/02ee4bd18e5d16e7fe5fc5b85b4aefa2cba6db28897f674226c9d6ddd2f34f06

Output from the following command tree -alh ~/.cache/huggingface/hub/models--stabilityai--stable-diffusion-xl-base-1.0:

[4.0K]  /home/remko/.cache/huggingface/hub/models--stabilityai--stable-diffusion-xl-base-1.0
├── [4.0K]  blobs
│   ├── [319M]  1598f3d24932bcfe6634e8b618ea1e30ab1d57f5aad13a6d2de446d2199f2341
│   ├── [319M]  27ed3b02e09638568e99d4398c67bc654dde04e6c0db61fb2d21dba630e7058a
│   ├── [ 472]  2c2130b544c0c5a72d5d00da071ba130a9800fb2
│   ├── [ 737]  2e8612a429492973fe60635b3f44a28b065cfac0
│   ├── [9.6G]  357650fbfb3c7b4d94c1f5fd7664da819ad1ff5a839430484b4ec422d03f710a
│   ├── [2.6G]  3a6032f63d37ae02bbc74ccd6a27440578cd71701f96532229d0154f55a8d3ff
│   ├── [1.0M]  469be27c5c010538f845f518c4f5e8574c78f7c8
│   ├── [469M]  5c3d6454dd2d23414b56aa1b5858a72487a656937847b6fea8d0606d7a42cdbc
│   ├── [235M]  660c6f5b1abae9dc498ac2d21e1347d2abdb0cf6c0c0c8576cd796491d9a6cdd
│   ├── [ 609]  6cc5138ddb67e9309c9b4e058e33e52087f1d215
│   ├── [512K]  76e821f1b6f0a9709293c3b6b51ed90980b3166b
│   ├── [4.8G]  83e012a805b84c7ca28e5646747c90a243c65c8ba4f070e2d7ddc9d74661e139
│   ├── [ 642]  a66a171ba7c8efb1a8fc3bdc64e65318eade8e13
│   ├── [ 725]  a8438e020c4497a429240d6b89e0bf9a6e2ffa92
│   ├── [ 460]  ae0c5be6f35217e51c4c000fd325d8de0294e99c
│   ├── [160M]  bcb60880a46b63dea58e9bc591abe15f8350bde47b405f9c38f4be70c6161e68
│   ├── [1.6K]  c8714c90f0e2409156da42781954416cb7df36af
│   ├── [ 565]  cde352ada4bb95fdad2fc503b8121257cef215a6
│   ├── [ 575]  da1848b5ed17b676f021578838f12c4023b86379
│   ├── [ 479]  e5bc8421e047838523be7acfb6720f167f7382f6
│   ├── [160M]  eb6516ab7e1104d5d1a174a4d65c57835ae38061531d0a2192103aecfb790cc1
│   └── [1.3G]  ec310df2af79c318e24d20511b601a591ca8cd4f1fce1d8dff822a356bcdb1f4
├── [4.0K]  refs
│   └── [  40]  main
└── [4.0K]  snapshots
    └── [4.0K]  462165984030d82259a11f4367a4eed129e94a7b
        ├── [  52]  model_index.json -> ../../blobs/6cc5138ddb67e9309c9b4e058e33e52087f1d215
        ├── [4.0K]  scheduler
        │   └── [  55]  scheduler_config.json -> ../../../blobs/e5bc8421e047838523be7acfb6720f167f7382f6
        ├── [4.0K]  text_encoder
        │   ├── [  55]  config.json -> ../../../blobs/cde352ada4bb95fdad2fc503b8121257cef215a6
        │   ├── [  79]  model.fp16.safetensors -> ../../../blobs/660c6f5b1abae9dc498ac2d21e1347d2abdb0cf6c0c0c8576cd796491d9a6cdd
        │   └── [  79]  model.safetensors -> ../../../blobs/5c3d6454dd2d23414b56aa1b5858a72487a656937847b6fea8d0606d7a42cdbc
        ├── [4.0K]  text_encoder_2
        │   ├── [  55]  config.json -> ../../../blobs/da1848b5ed17b676f021578838f12c4023b86379
        │   ├── [  79]  model.fp16.safetensors -> ../../../blobs/ec310df2af79c318e24d20511b601a591ca8cd4f1fce1d8dff822a356bcdb1f4
        │   └── [  79]  model.safetensors -> ../../../blobs/3a6032f63d37ae02bbc74ccd6a27440578cd71701f96532229d0154f55a8d3ff
        ├── [4.0K]  tokenizer
        │   ├── [  55]  merges.txt -> ../../../blobs/76e821f1b6f0a9709293c3b6b51ed90980b3166b
        │   ├── [  55]  special_tokens_map.json -> ../../../blobs/2c2130b544c0c5a72d5d00da071ba130a9800fb2
        │   ├── [  55]  tokenizer_config.json -> ../../../blobs/2e8612a429492973fe60635b3f44a28b065cfac0
        │   └── [  55]  vocab.json -> ../../../blobs/469be27c5c010538f845f518c4f5e8574c78f7c8
        ├── [4.0K]  tokenizer_2
        │   ├── [  55]  merges.txt -> ../../../blobs/76e821f1b6f0a9709293c3b6b51ed90980b3166b
        │   ├── [  55]  special_tokens_map.json -> ../../../blobs/ae0c5be6f35217e51c4c000fd325d8de0294e99c
        │   ├── [  55]  tokenizer_config.json -> ../../../blobs/a8438e020c4497a429240d6b89e0bf9a6e2ffa92
        │   └── [  55]  vocab.json -> ../../../blobs/469be27c5c010538f845f518c4f5e8574c78f7c8
        ├── [4.0K]  unet
        │   ├── [  55]  config.json -> ../../../blobs/c8714c90f0e2409156da42781954416cb7df36af
        │   ├── [  79]  diffusion_pytorch_model.fp16.safetensors -> ../../../blobs/83e012a805b84c7ca28e5646747c90a243c65c8ba4f070e2d7ddc9d74661e139
        │   └── [  79]  diffusion_pytorch_model.safetensors -> ../../../blobs/357650fbfb3c7b4d94c1f5fd7664da819ad1ff5a839430484b4ec422d03f710a
        ├── [4.0K]  vae
        │   ├── [  55]  config.json -> ../../../blobs/a66a171ba7c8efb1a8fc3bdc64e65318eade8e13
        │   ├── [  79]  diffusion_pytorch_model.fp16.safetensors -> ../../../blobs/bcb60880a46b63dea58e9bc591abe15f8350bde47b405f9c38f4be70c6161e68
        │   └── [  79]  diffusion_pytorch_model.safetensors -> ../../../blobs/1598f3d24932bcfe6634e8b618ea1e30ab1d57f5aad13a6d2de446d2199f2341
        └── [4.0K]  vae_1_0
            ├── [  79]  diffusion_pytorch_model.fp16.safetensors -> ../../../blobs/eb6516ab7e1104d5d1a174a4d65c57835ae38061531d0a2192103aecfb790cc1
            └── [  79]  diffusion_pytorch_model.safetensors -> ../../../blobs/27ed3b02e09638568e99d4398c67bc654dde04e6c0db61fb2d21dba630e7058a

Reproduction

Pretty standard sdxl t2i adapter diffusers pipeline, but also happens with the sdxl controlnet pipeline and standard sdxl pipeline:

import torch
import cv2 as cv
import numpy as np
from PIL import Image
from IPython.display import display

from diffusers import StableDiffusionXLAdapterPipeline, T2IAdapter
from diffusers.utils import load_image, make_image_grid

model_id = 'stabilityai/sdxl-turbo'
adapter_id = 'TencentARC/t2i-adapter-canny-sdxl-1.0'

adapter = T2IAdapter.from_pretrained(adapter_id, torch_dtype=torch.float16, variant='fp16').to('cuda')

pipe = StableDiffusionXLAdapterPipeline.from_pretrained(model_id, adapter=adapter, torch_dtype=torch.float16, variant='fp16').to('cuda')

Logs

No response

System info

- huggingface_hub version: 0.19.4
- Platform: Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
- Python version: 3.9.18
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /home/remko/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers:
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.1.0+cu121
- Jinja2: 3.1.2
- Graphviz: N/A
- Pydot: N/A
- Pillow: 10.1.0
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.26.2
- pydantic: N/A
- aiohttp: N/A
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /home/remko/.cache/huggingface/hub
- HF_ASSETS_CACHE: /home/remko/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/remko/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
Wauplin commented 11 months ago

Thanks for the detailed bug report @jquintanilla4! I haven't got the time to check it today (working on 0.20.0 release) so I'll have a look tomorrow :)

Wauplin commented 11 months ago

Hey @jquintanilla4 after having looked into it, there doesn't seem to be any duplicated file in your cache system, am I right? Or is it expected that weights from model stable-diffusion-xl-base-1.0 and sdxl-turbo happen to be the same? Because yes, this is a known limitations. Exact same files that are shared between repos are not shared in the local cache as we keep files separated in repos.

As an example here's the last two checkpoints when i run the huggingface-cli scan-cache command:

Regarding the delete-cache command, this is expected since you only have 1 revision per model (snapshot f4b0486b498f84668e828044de1d0c8ba486e05b for xl-base-1.0 and snapshot 462165984030d82259a11f4367a4eed129e94a7b for sdxl turbo). The delete-cache command is a tool meant to help cleaning the cache when several revisions of the same model have been downloaded.

When I use HF diffusers to download a model/checkpoint from HF using the .from_pretrained() method, it downloads the same model/checkpoint over and over, usually on different days or different times of the day. Between those repeated downloads, there's no updates to the checkpoint/model I try to download.

Do you have an example for this? If you're using different models, they will be downloaded which will eventually take space on your disk. But if the model file already exists in the cache, it is not re-downloaded. You might see a progress bars when download a repo but it should be almost instant (just the time to check files did not change).

The only possibility for duplicated files is if you download to a local folder in multiple locations and that symlinks are deactivated. Symlinks are deactivated on windows (if not admin/not developer mode) or if you passed local_dir_use_symlinks=False. If that's the case, duplication is expected but we can't do much about it.


I hope this answers your question. I might be missing something and if this is the case, please let me know!

jquintanilla4 commented 11 months ago

Just to make sure I'm understanding you're reply correctly... are you saying that huggingface hub cache keeps multiple copies of the same model/checkpoint if they're used by different local repos?

So for example, I am using SDXL Turbo on two different projects/repos:

So once those checkpoints are downloaded for their respective repos, then it shouldn't download again, because from that point forward they should load from cache. correct?

@Wauplin

Wauplin commented 11 months ago

@jquintanilla4 The generic answer is "checkpoints from a same model repository on the Hub are downloaded only once in the cache".

There are some flaws to this logic in some specific cases:

Apart from those corner cases, you can always expect checkpoint files to be reused without re-downloading/duplicates when calling .from_pretrained from different projects. And this is the case if you look at your report (tree -alh ~/.cache/huggingface/hub/models--stabilityai--sdxl-turbo). For more details on how the cache works internally, I invite you to read this guide. :hugs:

jquintanilla4 commented 11 months ago

Your response makes sense. That's how I expected the cache system to work. But then how did I end up with 20GB sized SDXL turbo directory when the checkpoint itself is only ~6GB?

Am I missing something? Am I being super dense?

From your explanation, If I understand it correctly. It should only be ~6GB, unless there was an update to the checkpoint and there after I would have a copy in the cache system for each corresponding version.

Thanks for the link. I've read that before I created the issue and commented on the other issue in diffusers. But I'm still confused as to why I have such an oversized directory for SDXL turbo and SDXL.

Honestly feeling a little dumb 😅. Thanks for your patience.

If you think my cache is working correctly. I can close this issue.

Wauplin commented 11 months ago

Your response makes sense. That's how I expected the cache system to work. But then how did I end up with 20GB sized SDXL turbo directory when the checkpoint itself is only ~6GB?

I think the question is more related to diffusers here. If you look at your cache, you have much more than a 6GB file. For example, ./unet/diffusion_pytorch_model.safetensors alone weights 9.6GB. When you load the model, there is more than "just" the sd_xl_turbo_1.0_fp16.safetensors (6GB) file that gets downloaded. It also looks like you have a copy of each weight file in both fp16.safetensors and .safetensors. That can be the case if you've instantiated the pipeline twice, once with fp16 precision and once without.

jquintanilla4 commented 11 months ago

Oh I see. Wow, how did i miss that? 🤦🏽‍♂️ Makes sense. My oversight.

No issue or bug here.

Thanks for your help. I'll be closing the issue.

Wauplin commented 11 months ago

Great! Glad that's clarified :hugs: