Open vladmandic opened 1 year ago
Hi @vladmandic, thanks for reporting. This is definitely not an expected behavior. It can happen than scan_cache
takes a lot of time but only when scanning repos with a lot of files, not repos with a few large files.
Would you mind trying to profile the execution? Here is a small script that should print some helpful information.
import cProfile, pstats
from huggingface_hub import scan_cache_dir
with cProfile.Profile() as profiler:
scan_cache_dir(cache_dir="models/Diffusers")
stats = pstats.Stats(profiler).sort_stats("cumtime")
stats.print_stats()
Output should look like this:
50091 function calls (49765 primitive calls) in 0.023 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.023 0.023 /home/wauplin/projects/huggingface_hub/src/huggingface_hub/utils/_cache_manager.py:496(scan_cache_dir)
105 0.001 0.000 0.022 0.000 /home/wauplin/projects/huggingface_hub/src/huggingface_hub/utils/_cache_manager.py:617(_scan_cached_repo)
120 0.000 0.000 0.009 0.000 /usr/lib/python3.10/pathlib.py:1064(resolve)
120 0.000 0.000 0.007 0.000 /usr/lib/python3.10/posixpath.py:391(realpath)
239/120 0.002 0.000 0.006 0.000 /usr/lib/python3.10/posixpath.py:400(_joinrealpath)
297 0.000 0.000 0.004 0.000 /usr/lib/python3.10/pathlib.py:1023(glob)
297 0.000 0.000 0.003 0.000 /usr/lib/python3.10/pathlib.py:487(_select_from)
982 0.000 0.000 0.003 0.000 /usr/lib/python3.10/pathlib.py:1092(stat)
361 0.000 0.000 0.003 0.000 /usr/lib/python3.10/pathlib.py:569(_parse_args)
982 0.002 0.000 0.003 0.000 {built-in method posix.stat}
...
Can you copy-paste the output? Thanks in advance!
sure. i've added a print on scan_cache_dir output so you have something to compare profile with
{'name': 'DucHaiten/DucHaitenAIart', 'size': 5482653204, 'files': 17}
{'name': 'SG161222/Realistic_Vision_V3.0_VAE', 'size': 1589683, 'files': 9}
{'name': 'stabilityai/stable-diffusion-xl-base-0.9', 'size': 20815320185, 'files': 24}
{'name': 'kandinsky-community/kandinsky-2-2-prior', 'size': 10575374030, 'files': 14}
{'name': 'runwayml/stable-diffusion-v1-5', 'size': 5482653293, 'files': 17}
{'name': 'stabilityai/stable-diffusion-xl-refiner-0.9', 'size': 12153213446, 'files': 14}
{'name': 'kandinsky-community/kandinsky-2-2-decoder', 'size': 5283997489, 'files': 8}
{'name': 'kandinsky-community/kandinsky-2-1', 'size': 7451029477, 'files': 14}
{'name': 'kandinsky-community/kandinsky-2-1-prior', 'size': 5800286042, 'files': 14}
{'name': 'thu-ml/unidiffuser-v1', 'size': 5502692481, 'files': 24}
{'name': 'SG161222/Realistic_Vision_V1.4', 'size': 5483044374, 'files': 17}
60888 function calls (60537 primitive calls) in 4.017 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 4.018 4.018 /home/vlado/.local/lib/python3.10/site-packages/huggingface_hub/utils/_cache_manager.py:497(scan_cache_dir)
12 0.003 0.000 4.014 0.334 /home/vlado/.local/lib/python3.10/site-packages/huggingface_hub/utils/_cache_manager.py:618(_scan_cached_repo)
189 0.001 0.000 2.055 0.011 /usr/lib/python3.10/pathlib.py:1064(resolve)
189 0.000 0.000 1.823 0.010 /usr/lib/python3.10/posixpath.py:391(realpath)
358/189 0.008 0.000 1.820 0.010 /usr/lib/python3.10/posixpath.py:400(_joinrealpath)
2002 1.503 0.001 1.503 0.001 {built-in method posix.lstat}
916 0.001 0.000 1.296 0.001 /usr/lib/python3.10/pathlib.py:1092(stat)
916 1.292 0.001 1.295 0.001 {built-in method posix.stat}
296 0.000 0.000 0.866 0.003 /usr/lib/python3.10/pathlib.py:1023(glob)
296 0.001 0.000 0.842 0.003 /usr/lib/python3.10/pathlib.py:487(_select_from)
268/120 0.097 0.000 0.624 0.005 /usr/lib/python3.10/pathlib.py:468(_iterate_directories)
319 0.001 0.000 0.597 0.002 /usr/lib/python3.10/pathlib.py:1300(is_dir)
273 0.407 0.001 0.407 0.001 {method 'is_dir' of 'posix.DirEntry' objects}
169 0.298 0.002 0.298 0.002 {built-in method posix.readlink}
212 0.000 0.000 0.245 0.001 /usr/lib/python3.10/pathlib.py:1285(exists)
194 0.237 0.001 0.237 0.001 {built-in method posix.scandir}
370 0.097 0.000 0.216 0.001 /usr/lib/python3.10/pathlib.py:438(_select_from)
24 0.000 0.000 0.024 0.001 /usr/lib/python3.10/pathlib.py:1316(is_file)
23 0.000 0.000 0.023 0.001 /usr/lib/python3.10/pathlib.py:398(select_from)
36 0.000 0.000 0.021 0.001 /usr/lib/python3.10/pathlib.py:1013(iterdir)
12 0.021 0.002 0.021 0.002 {built-in method posix.listdir}
11 0.000 0.000 0.012 0.001 /usr/lib/python3.10/pathlib.py:1111(open)
Wow, thanks for the stats. 10ms for a single PosixPath.resolve()
seems reaaaaally long, explaining why scanning the cache directory takes so long. I realized that you are running on WSL and that apparently it can explain very low performances to access the filesystem (see stackoverflow). Could that be that case? Would you be able to test running this script in a pure Windows environment just to check this is the root cause?
If it really is a limitation on WSL side, I'm not sure there is much we can do on our side to fix it :confused:
yes, wsl2 is slow when accessing external storage on ntfs volume - typically half of performance, but this is just over the top - there is no reason why resolve on that many files should be that slow, so cant blame just wsl2.
so i took a closer look and pathlib in general is pretty slow. i just tried pathlib, glob, os.walk and os.scandir and os.scandir is nearly double the performance while glob pretty much hangs on wsl2
for example, this is simple code using scandir
code that runs 2-3x faster
import os
def scandir(folder, search):
subfolders, files = [], []
for f in os.scandir(folder):
if f.is_dir():
subfolders.append(f.path)
elif f.is_file() and search in f.name:
files.append(f.path)
for dir in list(subfolders):
f, sf = scandir(dir, search)
subfolders.extend(sf)
files.extend(f)
return files, subfolders
Thanks for testing this. While I understand that pathlib
can be slower than os
methods, I still don't see how it can explain this order of magnitude of difference. On my laptop (not the best one, not the worse), a pathlib.resolve
takes ~75μs/call which is very far from the ~11000μs/call you are experiencing. I am not against the idea of optimizing some parts of the logic if it makes sense but it has to have a real benefit compared to the complexity introduced in the code (if I compare scandir to pathlib.glob in your snippet).
Here I don't think the problem lies in the difference between os.scandir
and pathlib.glob
which in both cases should be really quick. IMO what we should compare instead is pathlib.resolve
against os.path.realpath
. These methods are key since they follows the symbolic links that are created between blobs and snapshots in the cache folder. Do you think you can give it a try?
using this code
t0 = time()
resolved = [Path(f).resolve() for f in files]
print('pathlib resolve:', time() - t0)
t0 = time()
resolved = [os.path.realpath(f) for f in files]
print('os realpath:', time() - t0)
t0 = time()
resolved = [os.path.abspath(f) for f in files]
print('os abspath:', time() - t0)
on wsl2 accessing files on ntfs drive
pathlib resolve: 1.8370428562164307 os realpath: 1.261924648284912 os abspath: 0.0007224082946777344
on windows accessing same files:
pathlib resolve: 0.0560002326965332 os realpath: 0.04100513458251953 os abspath: 0.0009999275207519531
so yes, wsl2 has major issues with path resolving when accessing foreign filesystem. but os.path is still 50% faster than pathlib.
bigger issue is why huggingface model format relies on symlinks instead of config file. not all filesystems even natively support symlinks, this is really bad design choice. for example, a completely normal use-case for larger deployments is to use object-storage mounted as virtual drive. but that's not something we can solve here.
lets go back to primary use case - primary use case for hf.scan_cache
is to get list of models in cache. actually checking them, resolving, verifiying, what-not - all that should be optional. to return list of models, none of those operations should be needed.
@vladmandic a few different things to answer here:
so yes, wsl2 has major issues with path resolving when accessing foreign filesystem. but os.path is still 50% faster than pathlib.
Agree with you that pathlib
is indeed slower than os.path
by a certain margin (though it's more 34% faster on a "normal" setup - 50% on wsl2). Would you like to make a PR to use os.path instead of pathlib in scan_cache_dir
? I would still return Pathlib object instead of str in the returned values, at least for backward compatibility issues.
bigger issue is why huggingface model format relies on symlinks instead of config file. not all filesystems even natively support symlinks, this is really bad design choice.
It's definitely a design choice to use symlinks over other alternatives. But qualifying it as a bad design choice really depends on what you want to achieve. The thing is, using symlinks is by far easier and more user friendly than using config file. For systems that do not support symlinks (e.g. Windows without dev mode/admin) we still support downloading and caching models although it has its limitations. Here is a comment where I tried to summarize why we chose this design -and we will not change that.
primary use case for hf.scan_cache is to get list of models in cache. actually checking them, resolving, verifiying, what-not - all that should be optional. to return list of models, none of those operations should be needed.
The primary use case for hf.scan_cache
is to get a complete scan of your HF cache, nothing less. It has been primarily built to help users clean their cache but nothing more. Otherwise, returning a list of cached models is a quite hard task. huggingface_hub
is a tool to download files from the Hub but it doesn't know any abstraction about what a "model" is. If you download only a README file for a model, would you list it as a "downloaded model"? If a third-party library wants to list the cached models, it would need to define a rule about what a cached model is. In any case, that's not really something huggingface_hub
is meant to do.
Would you like to make a PR to use os.path instead of pathlib in
scan_cache_dir
i'd love to, but unfortunately no time right now - about to just on a plane and will be ooo for a while.
But qualifying it as a bad design choice really depends on what you want to achieve
its not about what i'm trying to acheive, its simple fact that plenty of filesystems do not have symlink capabilities.
for example, most of external usb drives are pre-formatted as exFAT and exFAT does not have symlinks at all. or how about network mounts? most don't support symlinks (smb doesn't for sure, nfs does), but even if it works is definite way to cause issues as different clients may resolve it differently. or any cloud mounted storage? behavior is undefined as aws/google/azure do not guarantee that symlinks will work when using cloud filesystem.
this basically creates a massive limitation what kind of storage can be used for cache_dir
and any org will want to use shared cache_dir so not every client needs to have its own local copy of models
so impact is not tiny, it basically prevents usage in larger environments where shared storage volume is a necessity.
i understand that this topic is beyond the scope of this issue, but i would advise to raise it for discussion inside hf.
The primary use case for hf.scan_cache is to get a complete scan of your HF cache, nothing less
i understand that, but think of it as a feature request then - param list_only
that when set to true only returns list of models.
if there is no such functionality, i will have to abandon hf.scan_cache
and re-implement it separately.
again, going back to larger orgs - yes, i do want to run full scan with consistency from main server, but all others just want to get list of models, nothing else.
fyi, i've implemened my own "quick model list" and completely removed any dependency on hf.scan_cache.
this isssue is open and zero progress in more than a month. although i have a workaround for slow scan-cache, that is just a workaround. and bigger concern is heavy usage of symlinks in general - that is actually blocking usage of hf in many environments.
this isssue is open and zero progress in more than a month.
Yes indeed and I'm not sure it will evolve. To be honest I don't know what we should do here between:
scan_cache
was primarily designed to check where to save disk space in the cache which requires more OS calls.os.path
instead of pathlib.Path
(as per https://github.com/huggingface/huggingface_hub/issues/1564#issuecomment-1646088533) => I am not willing to invest time in this without a clear advantage. I expect most users to only scan the cache once in a while.I am open to suggestions if you think there is a low hanging fruit we can work on to make your life easier but we will most probably not start to make big changes to the current system. Hoping you understand the decision.
I am open to suggestions if you think there is a low hanging fruit we can work on to make your life easier but we will most probably not start to make big changes to the current system. Hoping you understand the decision.
this is not about making my life easier as i've implemented my own list_cache
instead of using hf scan_cache
method.
but would i consider that a valuable addition to hf_hub code? absolutely? is it a low-hanging fruit? absolutely.
regarding usage of symlinks, i understand that is a much bigger conversation and i don't expect any changes to happen overnight. but current system is fundamentally flawed and perhaps needs a rethink for the future of diffusers. problems are not going away just because "it is what it is". (and yes, any possible future changes should be done with backward compatibility in mind, but that is really not that much of a problem)
one more example that several of my users experienced - they wanted to MOVE hf cache dir since it was filling up their home folder. but because of symlinks they ended up with 10x bigger hf cache on target since all symlinks were resolved to actual files. net result if you want to move hf cache dir? delete it and download everything from scratch.
as far as sdnext is concerned, i'm recommending safetensors instead of hf model format for any and all model types that support it. only use hf model format when there is absolutely no alternative.
this is not about making my life easier as i've implemented my own list_cache instead of using hf scan_cache method. but would i consider that a valuable addition to hf_hub code? absolutely? is it a low-hanging fruit? absolutely.
If I understand correctly, the main feature request of this issue is now: _"Can we have an helper to scan the cache, returning for each model a list of its snapshots and for each snapshot a list of the downloaded files but without resolving the symlinks for performances issue? -meaning only the file presence is returned, no information about size/lastaccessed/...-". Am I right? The conversation started to deviate quite a lot so I'm trying to understand what would be the good addition to add here.
one more example that several of my users experienced - they wanted to MOVE hf cache dir since it was filling up their home folder. but because of symlinks they ended up with 10x bigger hf cache on target since all symlinks were resolved to actual files. net result if you want to move hf cache dir? delete it and download everything from scratch.
If I understand correctly, a simple CLI command to move the cache from one path to another would be great for some users? Do you know why the symlinks were resolved in actual files instead of copying the ref only? And do you know their platform and what they used to transfer them?
If I understand correctly, the main feature request of this issue is now: _"Can we have an helper to scan the cache, returning for each model a list of its snapshots and for each snapshot a list of the downloaded files but without resolving the symlinks for performances issue? -meaning only the file presence is returned, no information about size/lastaccessed/...-". Am I right? The conversation started to deviate quite a lot so I'm trying to understand what would be the good addition to add here.
fair question. and yes, that is correct.
If I understand correctly, a simple CLI command to move the cache from one path to another would be great for some users?
correct.
Do you know why the symlinks were resolved in actual files instead of copying the ref only? And do you know their platform and what they used to transfer them?
windows explorer simple copy will fail to copy diffusers folder to a different drive as symlinks on windows only work within a drive and cannot be moved.
Created 2 new issues to be addressed separately:
windows explorer simple copy will fail to copy diffusers folder to a different drive as symlinks on windows only work within a drive and cannot be moved.
The simpler here will be to only move the blobs files and let hf_hub_download
recreate the symlinks at runtime if needed. I added a comment about that to the issue.
i'm ok with that.
note that biggest use case for move is users wanting to move entire :~/.cache/huggingface
to a different drive.
note that biggest use case for move is users wanting to move entire :~/.cache/huggingface to a different drive.
Yes, that's also what I have in mind!
Describe the bug
when using either
hf.scan_cache
via python or simplyhuggingface-cli scan-cache
to enumerate already downloaded models, it works fine regardless of number of models if models are smallbut once models become larger, it starts slowing down - to about 0.6sec for each 10gb of models. so having 100gf models in cache_dir (which is not that much nowadays), it results in 6seconds to do anything.
given i need to enumerate available models to show which ones are available in application ui (sd.next in my case), this is pretty bad.
Reproduction
download 10 large models and run scan-cache
Logs
System info