huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.02k stars 531 forks source link

Symlink `snapshot_download` files from cache #2284

Closed joecummings closed 4 months ago

joecummings commented 4 months ago

It appears that symlinking as a default was removed from the huggingface-cli download in #2223.

The Problem

A user that uses from_pretrained then tries to use torchtune, which requires you have the model checkpoints available in an easily accessible directory structure. Under the hood, this calls snapshot_download with a local_dir. After #2223, this process copies the files from ~/.cache/huggingface/hub to local_dir, resulting in 2X the disk space being used.

The Ask

Is there any way to utilize symlinking capabilities so users have a way to reduce disk usage?


Obviously, my example and interest lie in the use-case for torchtune, but I think this is applicable to anyone trying to use the huggingface-cli download command in conjuction with the from_pretrained API

cc: @Wauplin

Wauplin commented 4 months ago

Is there any way to utilize symlinking capabilities so users have a way to reduce disk usage?

Hi @joecummings, no, there is currently no way to do that and we most likely won't support it in the future. Usage of downloading to local dir + symlinking to real cache was quite low compared to the drawbacks it had (especially the massive confusion for users not knowing where the files are actually cached, but also problems on windows, on shared clusters, on mounted volumes, etc.). In the end we took the decision to completely separate the "cached process" that shares the cache directory between libraries and the "local process" that is designed to be managed by the users, including potential duplication. I'm sorry it that affects torchtune users and I hope we can find a suitable solution to either:

Note that file duplication was already silently happening before (on windows, shared clusters, mounted volumes, etc.). The difference now is that it makes it clear in all use cases -including a common unix usage-.

joecummings commented 4 months ago

Thanks for your response @Wauplin, and several possible options. For now, I think we want to avoid loading with transformers as we have our own model definitions. For the first suggestion, do the checkpoint files exist in the cache_dir as binary files or as readable safetensor and config JSON files?

Wauplin commented 4 months ago

For the first suggestion, do the checkpoint files exist in the cache_dir as binary files or as readable safetensor and config JSON files?

I'm not sure to understand this question. Safetensors files are binary files (in opposition to json files that are utf8 text files). No matter if you download to the cache or in a local directory, you will have the exact same files. In one case it is shared with other libraries, in the later it is solely in a directory managed by the user.

joecummings commented 4 months ago

I'm not sure to understand this question. Safetensors files are binary files (in opposition to json files that are utf8 text files). No matter if you download to the cache or in a local directory, you will have the exact same files. In one case it is shared with other libraries, in the later it is solely in a directory managed by the user.

Yep, I misspoke - thanks for the clarification! The point I was trying to understand was whether all the same files exist in the cache as do in the local_dir, which appears to be yes.

Wauplin commented 4 months ago

Yes, I confirm!

joecummings commented 4 months ago

Closing for now to clean up your Issues - thanks for the help!