Open sealad886 opened 6 months ago
The offending code can be found here, where the default cache location is sourced from environment variable HF_HUB_CACHE: https://github.com/huggingface/huggingface_hub/blame/ebba9ef2c338149783978b489ec142ab122af42a/src/huggingface_hub/utils/_cache_manager.py#L500
I say 'offending code', but that's just the original commit of that code. It was how it was designed at the time, I suppose, but I imagine it was decided later to have a shared blob download location to allow for datasets that had shared files? I'm guessing...
Thanks for pointing that out @sealad886!
The datasets
library is indeed managing its own cache and therefore not using the huggingface_hub
cache. This problem has already been reported in our ecosystem but fixing it is not as straightforward as it seems -namely because datasets
works with other providers as well. I will keep this issue open as long as the datasets
<> huggingface_hub
integration is not consistent. Stay tuned :wink:
I've recently noticed that I'm unable to use huggingface-cli scan-cache
to view datasets in my cache folder - see this Colab notebook for an example.
What seems to be happening is the following:
HF_DATASETS_CACHE
points to ~/.cache/huggingface/datasets
HF_HUB_CACHE
points to ~/.cache/huggingface/hub
HF_HUB_CACHE
, the cache scan fails because datasets are downloaded with a triple-underscore ___
between org / dataset name, while hfh
looks for folder names like datasets--{org}--{dataset_name}
. See this lineIs there a simple workaround in the setting of the env vars so that one can use huggingface-cli scan-cache
and huggingface-cli delete-cache
for both models and datasets?
Hi @lewtun thanks for the feedback. This is something specific to datasets
internals that is getting fixed in https://github.com/huggingface/datasets/pull/7105 by @lhoestq and @albertvillanova. Once released, all data will be downloaded to ~/.cache/huggingface/hub
by default. The ~/.cache/huggingface/datasets
will still be used but only for unzipping content / generating arrow files / etc. But all files downloaded from the Hub will be eligible to scan-cache
and delete-cache
:)
Thanks for the pointer @Wauplin ! If I'm not mistaken, there's still an issue with the cache directory naming in datasets
from this line, which replaces /
with ___
, while huggingface-cli scan-cache
looks for folder with this format: datasets--{org}--{dataset_name}
that only concerns the ~/.cache/huggingface/datasets
cache used only for unzipping content / generating arrow files / etc. and not eligible for scan-cache
Describe the bug
The cached location of datasets is variant depending on how you download them from Huggingface:
In this case, the default location (I'll use MacOS since that's what I have, but I'm assuming some level of overall consistency here) is:
$HOME/.cache/huggingface/hub/
. In the above example, the directory created isdatasets--wikimedia--wikisource
such that:In this case, the default location is no longer controlled by the environment variable HF_HUB_CACHE. The naming convention is also slightly different. The default location is:
$HOME/.cache/huggingface/datasets
and the data structure is:Using
huggingface-cli scan-cache
a user is unable to access the (actually useful) second cache location. I say "actually useful" because to date I haven't yet been able to figure out how to easily get a dataset cached with the CLI to be used in any models in code.Other issues that may or may not need separate tickets
huggingface-cli delete-cache
.Reproduction
Well...use the code and examples above.
Logs
No response
System info