Open JohnnyRacer opened 1 year ago
it is not currently supported, but cc'ing @McPatate (and @krampstudio for visibility)
thanks for the feature request!
Hey @JohnnyRacer, we are finally getting back to you about your feature request... and with good news!
The security scan details are now returned by the server on the /tree
endpoint with expand=True
as parameter. For example, https://huggingface.co/api/models/distilgpt2/tree/main?expand=True returns:
Results are paginated (50 files per page) which mean you need to iterate over the "next" links
header.
An implementation is in progress in huggingface_hub
(see PR: https://github.com/huggingface/huggingface_hub/pull/1451) so you can also benefit from it by installing the package from source. Please let me know what you think about this API :)
@Wauplin Thanks a lot for adding it as a new feature, just tested out the new API and it works very well! I was wondering if in the future a user downloading a potential malicious Pickle file would raise a confirmation warning to make it clear to them the risks before allowing the model to load? This seems like a good addition now that the scan data is available via the API.
I was wondering if in the future a user downloading a potential malicious Pickle file would raise a confirmation warning to make it clear to them the risks before allowing the model to load?
That's actually a pretty good idea! I'll keep it in mind to do it properly. I think that if we do something like that, it will be a warning but without asking confirmation from the user (or at least not by default). Always a balance to find between usability and security :)
I think having an extra option disable_unsafe_download
(defaulting to False
) could be a nice to have 😄
Yes, I was thinking about an environment variable directly so that users can be sure to disable them globally.
@McPatate before implementing this, what's your take on how to consider a file as "unsafe". I saw that the security scan has 3 levels: innocuous
, suspicious
and dangerous
.
I see four options here:
HF_HUB_DISABLE_UNSAFE_DOWNLOAD=1
raise an Exception on dangerous files but not suspicious onesHF_HUB_DISABLE_UNSAFE_DOWNLOAD=1
raise an Exception on dangerous files AND suspicious onesHF_HUB_DISABLE_DANGEROUS_DOWNLOAD=1
and HF_HUB_DISABLE_SUSPICIOUS_DOWNLOAD=1
separately. If "disable suspicious" is True, "disable dangerous" is automatically True as well => more explicit and flexible but it adds two env variables instead of one.HF_HUB_UNSAFE_DOWNLOAD_LEVEL
(or HF_HUB_SAFETY_LEVEL
?) to either "minimal" (default), "intermediate" or "maximal" => as much flexibility as 3. but with only 1 variable (caveat: I'd prefer a better naming if anyone has a suggestion)For me it really depends on what suspicious
means and when it happens. IMO if a user sets HF_HUB_DISABLE_UNSAFE_DOWNLOAD=1
, I'd rather forbid any non-innocuous download (i.e. option 2). Otherwise if flexibility is important, I'd go for option 4 but with a better naming.
Also cc @JohnnyRacer if you are opinionated on this topic.
@Wauplin
I think it would be a good idea to prevent downloads of dangerous Pickle checkpoints by default, with a warning message that can be disabled via an environment variable like you had mentioned above or by calling a function like transformers.env_configs.disable_unsafe_download(True) . To toggle the ability to allow for unsafe download on or off. I think a way to disable unsafe downloads is needed as it can be used conveniently as a malware dropper, especially when models are loaded onto production environments. For suspicious models I think a toggle-able warning will suffice since most do not contain dangerous
eval` which would allow for arbitrary code execution.
I think the future-proof way is to emit a warning on any model loading from Pickle, as we default more and more to the safetensors format (and its usage is growing rapidly)
Does it cost a lot to send a request to the hub to check before starting a download if there are dangerous imports?
It could also be done asynchronously, as in no changes in the download logic, except you add a cancellation token to the download job and if the file contains a dangerous import, the asynchronous job cancels the download, wdyt @julien-c?
EDIT: nvm my comment, the solution I'm proposing doesn't change much to the additional load issue
I think the future-proof way is to emit a warning on any model loading from Pickle, as we default more and more to the safetensors format (and its usage is growing rapidly)
That would definitely close this issue without additional load indeed. But that should rather live in libraries like transformers
, diffusers
,... isn't it? huggingface_hub
is kinda agnostic on the content of the files it downloads.
But that should rather live in libraries
Yes, correct
or even in torch
Hello, I am trying to obtain the pickle analysis data that is associated with pickled model checkpoints from Python programmatically but is unable to find anything in the docs on how to do this. Is there a clear way to do this?
A screenshot of the example of the analysis information is attached below.