huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
1.97k stars 513 forks source link

Add support for remote filesystem paths in Hugging Face CLI (--local-dir) #2407

Open kimminw00 opened 1 month ago

kimminw00 commented 1 month ago

huggingface-cli download provides a convenient way to interact with our pre-trained models and datasets. However, when working with large models and datasets, it can be cumbersome to download and manage them locally. To improve the user experience, I request a feature which supports for S3 paths remote filesystem paths in the Hugging Face CLI.

Wauplin commented 1 month ago

To improve the user experience, I request a feature which supports for S3 paths in the Hugging Face CLI.

What would be the goal of such a feature @kimminw00? Do you want to download models that are on the Hugging Face Hub to a S3 bucket? Or from an S3 bucket to the Hugging Face Hub? Or something else? What would be the CLI command you are expecting? Could you provide me with an example? Thanks in advance!

kimminw00 commented 1 month ago

What would be the goal of such a feature @kimminw00?

The goal of this feature is to support remote filesystems for --local-dir and --cache-dir.

Do you want to download models that are on the Hugging Face Hub to a S3 bucket? Or from an S3 bucket to the Hugging Face Hub? Or something else?

Download models that are on the Hugging Face Hub to a S3 bucket. (It would be nice to support other remote file systems)

What would be the CLI command you are expecting? Could you provide me with an example? Thanks in advance!

huggingface-cli repo download meta-llama/Meta-Llama-3.1-405B \
  --cache-dir s3://BUCKET_NAME/cache \
  --save-dir s3://BUCKET_NAME/models

(To emphasize that it also works on remote filesystems, I replaced --local-dir with --save-dir.)

Wauplin commented 1 month ago

Oooh, I see. Thanks for the examples! I don't think this will be supported in mid term perspective. The download process relies on some low-level IO features (filelock, symlinks, chmod) and turning it into a generic filesystem support would require heavy changes in the process. Furthermore, such a change would only be possible for --local-dir (that you renamed --save-dir) since the cache system uses symlinks which are not supported by most remote filesystems.

I think that the short-term best solution would be to build an ad-hoc tool (i.e. transfer from HF Hub to S3) and shared it with the community to see if there is interest in such a feature.