huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
1.82k stars 470 forks source link

Checksum validation with hf_hub_download on model files. #2364

Open JGSweets opened 2 days ago

JGSweets commented 2 days ago

Is your feature request related to a problem? Please describe. After reviewing: #1738 and #2223 it looks like file checksums are only computed on the cache dir in specific conditions. Ideally, a user could knowingly force a checksum post download as well as on retrieval from cache to ensure integrity of the files with any usage.

It's possible I misunderstood the code or discussion though.

Describe the solution you'd like Add an input arg and environment variable to enforce checksums on files for each hf_hub_download call on the retrieved files.

Describe alternatives you've considered Pre-downloading files manually and manually checking file integrity before using the cached files.

Wauplin commented 1 day ago

Hi @JGSweets, thanks for opening the issue. The 2 PRs you've linked are only related to "downloading to a local directory", not the generic "downloading into the HF cache directory" workflow. If we add such a validation, we would do it for both. The main problem with checking the file integrity after a download is the time it takes to do it:

cc @Pierrci @julien-c in case you have other opinion