huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.07k stars 546 forks source link

huggingface-cli scan-cache doesn't capture cached datasets #2218

Open sealad886 opened 6 months ago

sealad886 commented 6 months ago

Describe the bug

The cached location of datasets is variant depending on how you download them from Huggingface:

  1. Download using the CLI:
    > huggingface-cli download 'wikimedia/wikisource' --repo-type dataset

    In this case, the default location (I'll use MacOS since that's what I have, but I'm assuming some level of overall consistency here) is: $HOME/.cache/huggingface/hub/. In the above example, the directory created is datasets--wikimedia--wikisource such that:

    datasets--wikimedia--wikisource
    |--blobs
    --<blobs>
    |--refs
    --<?> #only one file in mine anyway
    |--snapshots
    |--<snapshot hash>
        --<symlinked content to blobs>
  2. Download using Huggingface datasets:
    >>> from datasets import load_dataset
    >>> ds = load_dataset('wikimedia/wikisource')

    In this case, the default location is no longer controlled by the environment variable HF_HUB_CACHE. The naming convention is also slightly different. The default location is: $HOME/.cache/huggingface/datasets and the data structure is:

    datasets
    |--downloads
    --<shared blobs location>
    |--wikimedia___wikisource     # note the 3 underscores
    --<symlinked content to downloads folder>

Using huggingface-cli scan-cache a user is unable to access the (actually useful) second cache location. I say "actually useful" because to date I haven't yet been able to figure out how to easily get a dataset cached with the CLI to be used in any models in code.

Other issues that may or may not need separate tickets

  1. Datasets will be downloaded twice if both methods are used.
  2. Datasets used by one download method are inaccessible (using standard tools and defaults) to the other method.
  3. You can't delete cached datasets in the second method using huggingface-cli delete-cache.

Reproduction

Well...use the code and examples above.

Logs

No response

System info

- huggingface_hub version: 0.22.2
- Platform: macOS-14.4.1-arm64-arm-64bit
- Python version: 3.12.2
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /Users/andrew/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: sealad886
- Configured git credential helpers: osxkeychain
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.2.2
- Jinja2: 3.1.3
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 10.3.0
- hf_transfer: 0.1.6
- gradio: 4.21.0
- tensorboard: N/A
- numpy: 1.26.4
- pydantic: 2.6.4
- aiohttp: 3.9.3
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /Users/andrew/.cache/huggingface/hub
- HF_ASSETS_CACHE: /Users/andrew/.cache/huggingface/assets
- HF_TOKEN_PATH: /Users/andrew/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
sealad886 commented 6 months ago

The offending code can be found here, where the default cache location is sourced from environment variable HF_HUB_CACHE: https://github.com/huggingface/huggingface_hub/blame/ebba9ef2c338149783978b489ec142ab122af42a/src/huggingface_hub/utils/_cache_manager.py#L500

I say 'offending code', but that's just the original commit of that code. It was how it was designed at the time, I suppose, but I imagine it was decided later to have a shared blob download location to allow for datasets that had shared files? I'm guessing...

Wauplin commented 6 months ago

Thanks for pointing that out @sealad886!

The datasets library is indeed managing its own cache and therefore not using the huggingface_hub cache. This problem has already been reported in our ecosystem but fixing it is not as straightforward as it seems -namely because datasets works with other providers as well. I will keep this issue open as long as the datasets <> huggingface_hub integration is not consistent. Stay tuned :wink:

lewtun commented 2 months ago

I've recently noticed that I'm unable to use huggingface-cli scan-cache to view datasets in my cache folder - see this Colab notebook for an example.

What seems to be happening is the following:

Is there a simple workaround in the setting of the env vars so that one can use huggingface-cli scan-cache and huggingface-cli delete-cache for both models and datasets?

Wauplin commented 2 months ago

Hi @lewtun thanks for the feedback. This is something specific to datasets internals that is getting fixed in https://github.com/huggingface/datasets/pull/7105 by @lhoestq and @albertvillanova. Once released, all data will be downloaded to ~/.cache/huggingface/hub by default. The ~/.cache/huggingface/datasets will still be used but only for unzipping content / generating arrow files / etc. But all files downloaded from the Hub will be eligible to scan-cache and delete-cache :)

lewtun commented 2 months ago

Thanks for the pointer @Wauplin ! If I'm not mistaken, there's still an issue with the cache directory naming in datasets from this line, which replaces / with ___, while huggingface-cli scan-cache looks for folder with this format: datasets--{org}--{dataset_name}

lhoestq commented 2 months ago

that only concerns the ~/.cache/huggingface/datasets cache used only for unzipping content / generating arrow files / etc. and not eligible for scan-cache