scan-cache experiences extreme slowdowns with large models

huggingface / huggingface_hub

The official Python client for the Huggingface Hub.

https://huggingface.co/docs/huggingface_hub

Apache License 2.0

1.87k stars 482 forks source link

scan-cache experiences extreme slowdowns with large models #1564

Open vladmandic opened 1 year ago

vladmandic commented 1 year ago

Describe the bug

when using either hf.scan_cache via python or simply huggingface-cli scan-cache to enumerate already downloaded models, it works fine regardless of number of models if models are small

but once models become larger, it starts slowing down - to about 0.6sec for each 10gb of models. so having 100gf models in cache_dir (which is not that much nowadays), it results in 6seconds to do anything.

given i need to enumerate available models to show which ones are available in application ui (sd.next in my case), this is pretty bad.

Reproduction

download 10 large models and run scan-cache

Logs

> huggingface-cli scan-cache --dir models/Diffusers

      REPO ID                                        REPO TYPE SIZE ON DISK NB FILES LAST_ACCESSED LAST_MODIFIED REFS LOCAL PATH
      ---------------------------------------------- --------- ------------ -------- ------------- ------------- ---- -------------------------------------------------------------------------------
      DucHaiten/DucHaitenAIart                       model             5.5G       16 1 week ago    2 weeks ago   main /mnt/d/Models/Diffusers/models--DucHaiten--DucHaitenAIart
      SG161222/Realistic_Vision_V1.4                 model             5.5G       16 1 week ago    1 week ago    main /mnt/d/Models/Diffusers/models--SG161222--Realistic_Vision_V1.4
      SG161222/Realistic_Vision_V3.0_VAE             model             1.6M        9 1 week ago    1 week ago    main /mnt/d/Models/Diffusers/models--SG161222--Realistic_Vision_V3.0_VAE
      kandinsky-community/kandinsky-2-1              model             7.5G       13 5 days ago    5 days ago    main /mnt/d/Models/Diffusers/models--kandinsky-community--kandinsky-2-1
      kandinsky-community/kandinsky-2-1-prior        model             5.8G       14 5 days ago    5 days ago    main /mnt/d/Models/Diffusers/models--kandinsky-community--kandinsky-2-1-prior
      kandinsky-community/kandinsky-2-2-decoder      model             5.3G        7 4 days ago    4 days ago    main /mnt/d/Models/Diffusers/models--kandinsky-community--kandinsky-2-2-decoder
      kandinsky-community/kandinsky-2-2-prior        model            10.6G       14 4 days ago    4 days ago    main /mnt/d/Models/Diffusers/models--kandinsky-community--kandinsky-2-2-prior
      runwayml/stable-diffusion-v1-5                 model             5.5G       16 1 week ago    2 weeks ago   main /mnt/d/Models/Diffusers/models--runwayml--stable-diffusion-v1-5
      stabilityai/stable-diffusion-xl-base-0.9       model            20.8G       22 1 week ago    1 week ago    main /mnt/d/Models/Diffusers/models--stabilityai--stable-diffusion-xl-base-0.9
      stabilityai/stable-diffusion-xl-refiner-0.9    model            12.2G       13 1 week ago    1 week ago    main /mnt/d/Models/Diffusers/models--stabilityai--stable-diffusion-xl-refiner-0.9
      thu-ml/unidiffuser-v1                          model             5.5G       23 1 day ago     1 day ago     main /mnt/d/Models/Diffusers/models--thu-ml--unidiffuser-v1

  Done in 5.1s. Scanned 13 repo(s) for a total of 84.0G.

System info

- huggingface_hub version: 0.16.4
- Platform: Linux-6.1.21.2-microsoft-standard-WSL2-x86_64-with-glibc2.35
- Python version: 3.10.6
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /home/vlado/.cache/huggingface/token
- Has saved token ?: True
- Configured git credential helpers: 2592000
- FastAI: N/A
- Tensorflow: 2.12.0
- Torch: 2.1.0.dev20230701+cu121
- Jinja2: 3.1.2
- Graphviz: N/A
- Pydot: N/A
- Pillow: 9.5.0
- hf_transfer: N/A
- gradio: 3.32.0
- tensorboard: 2.6.1
- numpy: 1.23.5
- pydantic: 1.10.11
- aiohttp: 3.8.4
- ENDPOINT: https://huggingface.co
- HUGGINGFACE_HUB_CACHE: /home/vlado/.cache/huggingface/hub
- HUGGINGFACE_ASSETS_CACHE: /home/vlado/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/vlado/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False

Wauplin commented 1 year ago

Hi @vladmandic, thanks for reporting. This is definitely not an expected behavior. It can happen than scan_cache takes a lot of time but only when scanning repos with a lot of files, not repos with a few large files.

Would you mind trying to profile the execution? Here is a small script that should print some helpful information.

import cProfile, pstats
from huggingface_hub import scan_cache_dir

with cProfile.Profile() as profiler:
    scan_cache_dir(cache_dir="models/Diffusers")

stats = pstats.Stats(profiler).sort_stats("cumtime")
stats.print_stats()

Output should look like this:

         50091 function calls (49765 primitive calls) in 0.023 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.023    0.023 /home/wauplin/projects/huggingface_hub/src/huggingface_hub/utils/_cache_manager.py:496(scan_cache_dir)
      105    0.001    0.000    0.022    0.000 /home/wauplin/projects/huggingface_hub/src/huggingface_hub/utils/_cache_manager.py:617(_scan_cached_repo)
      120    0.000    0.000    0.009    0.000 /usr/lib/python3.10/pathlib.py:1064(resolve)
      120    0.000    0.000    0.007    0.000 /usr/lib/python3.10/posixpath.py:391(realpath)
  239/120    0.002    0.000    0.006    0.000 /usr/lib/python3.10/posixpath.py:400(_joinrealpath)
      297    0.000    0.000    0.004    0.000 /usr/lib/python3.10/pathlib.py:1023(glob)
      297    0.000    0.000    0.003    0.000 /usr/lib/python3.10/pathlib.py:487(_select_from)
      982    0.000    0.000    0.003    0.000 /usr/lib/python3.10/pathlib.py:1092(stat)
      361    0.000    0.000    0.003    0.000 /usr/lib/python3.10/pathlib.py:569(_parse_args)
      982    0.002    0.000    0.003    0.000 {built-in method posix.stat}
      ...

Can you copy-paste the output? Thanks in advance!

vladmandic commented 1 year ago

sure. i've added a print on scan_cache_dir output so you have something to compare profile with

{'name': 'DucHaiten/DucHaitenAIart', 'size': 5482653204, 'files': 17}
{'name': 'SG161222/Realistic_Vision_V3.0_VAE', 'size': 1589683, 'files': 9}
{'name': 'stabilityai/stable-diffusion-xl-base-0.9', 'size': 20815320185, 'files': 24}
{'name': 'kandinsky-community/kandinsky-2-2-prior', 'size': 10575374030, 'files': 14}
{'name': 'runwayml/stable-diffusion-v1-5', 'size': 5482653293, 'files': 17}
{'name': 'stabilityai/stable-diffusion-xl-refiner-0.9', 'size': 12153213446, 'files': 14}
{'name': 'kandinsky-community/kandinsky-2-2-decoder', 'size': 5283997489, 'files': 8}
{'name': 'kandinsky-community/kandinsky-2-1', 'size': 7451029477, 'files': 14}
{'name': 'kandinsky-community/kandinsky-2-1-prior', 'size': 5800286042, 'files': 14}
{'name': 'thu-ml/unidiffuser-v1', 'size': 5502692481, 'files': 24}
{'name': 'SG161222/Realistic_Vision_V1.4', 'size': 5483044374, 'files': 17}

         60888 function calls (60537 primitive calls) in 4.017 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.018    4.018 /home/vlado/.local/lib/python3.10/site-packages/huggingface_hub/utils/_cache_manager.py:497(scan_cache_dir)
       12    0.003    0.000    4.014    0.334 /home/vlado/.local/lib/python3.10/site-packages/huggingface_hub/utils/_cache_manager.py:618(_scan_cached_repo)
      189    0.001    0.000    2.055    0.011 /usr/lib/python3.10/pathlib.py:1064(resolve)
      189    0.000    0.000    1.823    0.010 /usr/lib/python3.10/posixpath.py:391(realpath)
  358/189    0.008    0.000    1.820    0.010 /usr/lib/python3.10/posixpath.py:400(_joinrealpath)
     2002    1.503    0.001    1.503    0.001 {built-in method posix.lstat}
      916    0.001    0.000    1.296    0.001 /usr/lib/python3.10/pathlib.py:1092(stat)
      916    1.292    0.001    1.295    0.001 {built-in method posix.stat}
      296    0.000    0.000    0.866    0.003 /usr/lib/python3.10/pathlib.py:1023(glob)
      296    0.001    0.000    0.842    0.003 /usr/lib/python3.10/pathlib.py:487(_select_from)
  268/120    0.097    0.000    0.624    0.005 /usr/lib/python3.10/pathlib.py:468(_iterate_directories)
      319    0.001    0.000    0.597    0.002 /usr/lib/python3.10/pathlib.py:1300(is_dir)
      273    0.407    0.001    0.407    0.001 {method 'is_dir' of 'posix.DirEntry' objects}
      169    0.298    0.002    0.298    0.002 {built-in method posix.readlink}
      212    0.000    0.000    0.245    0.001 /usr/lib/python3.10/pathlib.py:1285(exists)
      194    0.237    0.001    0.237    0.001 {built-in method posix.scandir}
      370    0.097    0.000    0.216    0.001 /usr/lib/python3.10/pathlib.py:438(_select_from)
       24    0.000    0.000    0.024    0.001 /usr/lib/python3.10/pathlib.py:1316(is_file)
       23    0.000    0.000    0.023    0.001 /usr/lib/python3.10/pathlib.py:398(select_from)
       36    0.000    0.000    0.021    0.001 /usr/lib/python3.10/pathlib.py:1013(iterdir)
       12    0.021    0.002    0.021    0.002 {built-in method posix.listdir}
       11    0.000    0.000    0.012    0.001 /usr/lib/python3.10/pathlib.py:1111(open)

Wauplin commented 1 year ago

Wow, thanks for the stats. 10ms for a single PosixPath.resolve() seems reaaaaally long, explaining why scanning the cache directory takes so long. I realized that you are running on WSL and that apparently it can explain very low performances to access the filesystem (see stackoverflow). Could that be that case? Would you be able to test running this script in a pure Windows environment just to check this is the root cause?

Wauplin commented 1 year ago

If it really is a limitation on WSL side, I'm not sure there is much we can do on our side to fix it :confused:

vladmandic commented 1 year ago

yes, wsl2 is slow when accessing external storage on ntfs volume - typically half of performance, but this is just over the top - there is no reason why resolve on that many files should be that slow, so cant blame just wsl2.

so i took a closer look and pathlib in general is pretty slow. i just tried pathlib, glob, os.walk and os.scandir and os.scandir is nearly double the performance while glob pretty much hangs on wsl2

for example, this is simple code using scandir code that runs 2-3x faster

import os
def scandir(folder, search):
    subfolders, files = [], []
    for f in os.scandir(folder):
        if f.is_dir():
            subfolders.append(f.path)
        elif f.is_file() and search in f.name:
            files.append(f.path)
    for dir in list(subfolders):
        f, sf = scandir(dir, search)
        subfolders.extend(sf)
        files.extend(f)
    return files, subfolders

Wauplin commented 1 year ago

Thanks for testing this. While I understand that pathlib can be slower than os methods, I still don't see how it can explain this order of magnitude of difference. On my laptop (not the best one, not the worse), a pathlib.resolve takes ~75μs/call which is very far from the ~11000μs/call you are experiencing. I am not against the idea of optimizing some parts of the logic if it makes sense but it has to have a real benefit compared to the complexity introduced in the code (if I compare scandir to pathlib.glob in your snippet).

Here I don't think the problem lies in the difference between os.scandir and pathlib.glob which in both cases should be really quick. IMO what we should compare instead is pathlib.resolve against os.path.realpath. These methods are key since they follows the symbolic links that are created between blobs and snapshots in the cache folder. Do you think you can give it a try?

vladmandic commented 1 year ago

using this code

t0 = time()
resolved = [Path(f).resolve() for f in files]
print('pathlib resolve:', time() - t0)

t0 = time()
resolved = [os.path.realpath(f) for f in files]
print('os realpath:', time() - t0)

t0 = time()
resolved = [os.path.abspath(f) for f in files]
print('os abspath:', time() - t0)

on wsl2 accessing files on ntfs drive

pathlib resolve: 1.8370428562164307 os realpath: 1.261924648284912 os abspath: 0.0007224082946777344

on windows accessing same files:

pathlib resolve: 0.0560002326965332 os realpath: 0.04100513458251953 os abspath: 0.0009999275207519531

so yes, wsl2 has major issues with path resolving when accessing foreign filesystem. but os.path is still 50% faster than pathlib.

bigger issue is why huggingface model format relies on symlinks instead of config file. not all filesystems even natively support symlinks, this is really bad design choice. for example, a completely normal use-case for larger deployments is to use object-storage mounted as virtual drive. but that's not something we can solve here.

lets go back to primary use case - primary use case for hf.scan_cache is to get list of models in cache. actually checking them, resolving, verifiying, what-not - all that should be optional. to return list of models, none of those operations should be needed.

Wauplin commented 1 year ago

@vladmandic a few different things to answer here:

so yes, wsl2 has major issues with path resolving when accessing foreign filesystem. but os.path is still 50% faster than pathlib.

Agree with you that pathlib is indeed slower than os.path by a certain margin (though it's more 34% faster on a "normal" setup - 50% on wsl2). Would you like to make a PR to use os.path instead of pathlib in scan_cache_dir ? I would still return Pathlib object instead of str in the returned values, at least for backward compatibility issues.

bigger issue is why huggingface model format relies on symlinks instead of config file. not all filesystems even natively support symlinks, this is really bad design choice.

It's definitely a design choice to use symlinks over other alternatives. But qualifying it as a bad design choice really depends on what you want to achieve. The thing is, using symlinks is by far easier and more user friendly than using config file. For systems that do not support symlinks (e.g. Windows without dev mode/admin) we still support downloading and caching models although it has its limitations. Here is a comment where I tried to summarize why we chose this design -and we will not change that.

primary use case for hf.scan_cache is to get list of models in cache. actually checking them, resolving, verifiying, what-not - all that should be optional. to return list of models, none of those operations should be needed.

The primary use case for hf.scan_cache is to get a complete scan of your HF cache, nothing less. It has been primarily built to help users clean their cache but nothing more. Otherwise, returning a list of cached models is a quite hard task. huggingface_hub is a tool to download files from the Hub but it doesn't know any abstraction about what a "model" is. If you download only a README file for a model, would you list it as a "downloaded model"? If a third-party library wants to list the cached models, it would need to define a rule about what a cached model is. In any case, that's not really something huggingface_hub is meant to do.

vladmandic commented 1 year ago

Would you like to make a PR to use os.path instead of pathlib in scan_cache_dir

i'd love to, but unfortunately no time right now - about to just on a plane and will be ooo for a while.

But qualifying it as a bad design choice really depends on what you want to achieve

its not about what i'm trying to acheive, its simple fact that plenty of filesystems do not have symlink capabilities.

for example, most of external usb drives are pre-formatted as exFAT and exFAT does not have symlinks at all. or how about network mounts? most don't support symlinks (smb doesn't for sure, nfs does), but even if it works is definite way to cause issues as different clients may resolve it differently. or any cloud mounted storage? behavior is undefined as aws/google/azure do not guarantee that symlinks will work when using cloud filesystem.

this basically creates a massive limitation what kind of storage can be used for cache_dir and any org will want to use shared cache_dir so not every client needs to have its own local copy of models

so impact is not tiny, it basically prevents usage in larger environments where shared storage volume is a necessity.

i understand that this topic is beyond the scope of this issue, but i would advise to raise it for discussion inside hf.

The primary use case for hf.scan_cache is to get a complete scan of your HF cache, nothing less

i understand that, but think of it as a feature request then - param list_only that when set to true only returns list of models. if there is no such functionality, i will have to abandon hf.scan_cache and re-implement it separately.

again, going back to larger orgs - yes, i do want to run full scan with consistency from main server, but all others just want to get list of models, nothing else.

vladmandic commented 10 months ago

fyi, i've implemened my own "quick model list" and completely removed any dependency on hf.scan_cache.

code is at https://github.com/vladmandic/automatic/blob/21877351872aef2bb402ac83501fe3f383e8ef0e/modules/modelloader.py#L195-L210

vladmandic commented 9 months ago

this isssue is open and zero progress in more than a month. although i have a workaround for slow scan-cache, that is just a workaround. and bigger concern is heavy usage of symlinks in general - that is actually blocking usage of hf in many environments.

Wauplin commented 9 months ago

this isssue is open and zero progress in more than a month.

Yes indeed and I'm not sure it will evolve. To be honest I don't know what we should do here between:

having a light "scan-cache" method that only lists repos and snapshots => why not but the main reason it's not done is that scan_cache was primarily designed to check where to save disk space in the cache which requires more OS calls.
reimplement scan cache to use os.path instead of pathlib.Path (as per https://github.com/huggingface/huggingface_hub/issues/1564#issuecomment-1646088533) => I am not willing to invest time in this without a clear advantage. I expect most users to only scan the cache once in a while.
completely get rid of symlinks => this is not planned. I understand the points you mentioned above (https://github.com/huggingface/huggingface_hub/issues/1564#issuecomment-1647854653) but removing symlinks would also come with some caveats. Now that we chose to design the cache system how it is, we will keep it as it is -at least for the sake of consistency.

I am open to suggestions if you think there is a low hanging fruit we can work on to make your life easier but we will most probably not start to make big changes to the current system. Hoping you understand the decision.

vladmandic commented 9 months ago

I am open to suggestions if you think there is a low hanging fruit we can work on to make your life easier but we will most probably not start to make big changes to the current system. Hoping you understand the decision.

this is not about making my life easier as i've implemented my own list_cache instead of using hf scan_cache method. but would i consider that a valuable addition to hf_hub code? absolutely? is it a low-hanging fruit? absolutely.

regarding usage of symlinks, i understand that is a much bigger conversation and i don't expect any changes to happen overnight. but current system is fundamentally flawed and perhaps needs a rethink for the future of diffusers. problems are not going away just because "it is what it is". (and yes, any possible future changes should be done with backward compatibility in mind, but that is really not that much of a problem)

one more example that several of my users experienced - they wanted to MOVE hf cache dir since it was filling up their home folder. but because of symlinks they ended up with 10x bigger hf cache on target since all symlinks were resolved to actual files. net result if you want to move hf cache dir? delete it and download everything from scratch.

as far as sdnext is concerned, i'm recommending safetensors instead of hf model format for any and all model types that support it. only use hf model format when there is absolutely no alternative.

Wauplin commented 9 months ago

this is not about making my life easier as i've implemented my own list_cache instead of using hf scan_cache method. but would i consider that a valuable addition to hf_hub code? absolutely? is it a low-hanging fruit? absolutely.

If I understand correctly, the main feature request of this issue is now: _"Can we have an helper to scan the cache, returning for each model a list of its snapshots and for each snapshot a list of the downloaded files but without resolving the symlinks for performances issue? -meaning only the file presence is returned, no information about size/lastaccessed/...-". Am I right? The conversation started to deviate quite a lot so I'm trying to understand what would be the good addition to add here.

one more example that several of my users experienced - they wanted to MOVE hf cache dir since it was filling up their home folder. but because of symlinks they ended up with 10x bigger hf cache on target since all symlinks were resolved to actual files. net result if you want to move hf cache dir? delete it and download everything from scratch.

If I understand correctly, a simple CLI command to move the cache from one path to another would be great for some users? Do you know why the symlinks were resolved in actual files instead of copying the ref only? And do you know their platform and what they used to transfer them?

vladmandic commented 9 months ago

If I understand correctly, the main feature request of this issue is now: _"Can we have an helper to scan the cache, returning for each model a list of its snapshots and for each snapshot a list of the downloaded files but without resolving the symlinks for performances issue? -meaning only the file presence is returned, no information about size/lastaccessed/...-". Am I right? The conversation started to deviate quite a lot so I'm trying to understand what would be the good addition to add here.

fair question. and yes, that is correct.

If I understand correctly, a simple CLI command to move the cache from one path to another would be great for some users?

correct.

Do you know why the symlinks were resolved in actual files instead of copying the ref only? And do you know their platform and what they used to transfer them?

windows explorer simple copy will fail to copy diffusers folder to a different drive as symlinks on windows only work within a drive and cannot be moved.

Wauplin commented 9 months ago

Created 2 new issues to be addressed separately:

windows explorer simple copy will fail to copy diffusers folder to a different drive as symlinks on windows only work within a drive and cannot be moved.

The simpler here will be to only move the blobs files and let hf_hub_download recreate the symlinks at runtime if needed. I added a comment about that to the issue.

vladmandic commented 9 months ago

i'm ok with that. note that biggest use case for move is users wanting to move entire :~/.cache/huggingface to a different drive.

Wauplin commented 9 months ago

note that biggest use case for move is users wanting to move entire :~/.cache/huggingface to a different drive.

Yes, that's also what I have in mind!