huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.04k stars 537 forks source link

Getting the pickle analysis data from within Python #1212

Open JohnnyRacer opened 1 year ago

JohnnyRacer commented 1 year ago

Hello, I am trying to obtain the pickle analysis data that is associated with pickled model checkpoints from Python programmatically but is unable to find anything in the docs on how to do this. Is there a clear way to do this?

A screenshot of the example of the analysis information is attached below. pickle

julien-c commented 1 year ago

it is not currently supported, but cc'ing @McPatate (and @krampstudio for visibility)

thanks for the feature request!

Wauplin commented 1 year ago

Hey @JohnnyRacer, we are finally getting back to you about your feature request... and with good news!

The security scan details are now returned by the server on the /tree endpoint with expand=True as parameter. For example, https://huggingface.co/api/models/distilgpt2/tree/main?expand=True returns:

2023-04-26_16-43

Results are paginated (50 files per page) which mean you need to iterate over the "next" links header.

An implementation is in progress in huggingface_hub (see PR: https://github.com/huggingface/huggingface_hub/pull/1451) so you can also benefit from it by installing the package from source. Please let me know what you think about this API :)

JohnnyRacer commented 1 year ago

@Wauplin Thanks a lot for adding it as a new feature, just tested out the new API and it works very well! I was wondering if in the future a user downloading a potential malicious Pickle file would raise a confirmation warning to make it clear to them the risks before allowing the model to load? This seems like a good addition now that the scan data is available via the API.

Wauplin commented 1 year ago

I was wondering if in the future a user downloading a potential malicious Pickle file would raise a confirmation warning to make it clear to them the risks before allowing the model to load?

That's actually a pretty good idea! I'll keep it in mind to do it properly. I think that if we do something like that, it will be a warning but without asking confirmation from the user (or at least not by default). Always a balance to find between usability and security :)

McPatate commented 1 year ago

I think having an extra option disable_unsafe_download (defaulting to False) could be a nice to have 😄

Wauplin commented 1 year ago

Yes, I was thinking about an environment variable directly so that users can be sure to disable them globally.

Wauplin commented 1 year ago

@McPatate before implementing this, what's your take on how to consider a file as "unsafe". I saw that the security scan has 3 levels: innocuous, suspicious and dangerous.

I see four options here:

  1. set HF_HUB_DISABLE_UNSAFE_DOWNLOAD=1 raise an Exception on dangerous files but not suspicious ones
  2. set HF_HUB_DISABLE_UNSAFE_DOWNLOAD=1 raise an Exception on dangerous files AND suspicious ones
  3. set HF_HUB_DISABLE_DANGEROUS_DOWNLOAD=1 and HF_HUB_DISABLE_SUSPICIOUS_DOWNLOAD=1 separately. If "disable suspicious" is True, "disable dangerous" is automatically True as well => more explicit and flexible but it adds two env variables instead of one.
  4. set HF_HUB_UNSAFE_DOWNLOAD_LEVEL (or HF_HUB_SAFETY_LEVEL?) to either "minimal" (default), "intermediate" or "maximal" => as much flexibility as 3. but with only 1 variable (caveat: I'd prefer a better naming if anyone has a suggestion)

For me it really depends on what suspicious means and when it happens. IMO if a user sets HF_HUB_DISABLE_UNSAFE_DOWNLOAD=1, I'd rather forbid any non-innocuous download (i.e. option 2). Otherwise if flexibility is important, I'd go for option 4 but with a better naming.

Also cc @JohnnyRacer if you are opinionated on this topic.

JohnnyRacer commented 1 year ago

@Wauplin

I think it would be a good idea to prevent downloads of dangerous Pickle checkpoints by default, with a warning message that can be disabled via an environment variable like you had mentioned above or by calling a function like transformers.env_configs.disable_unsafe_download(True) . To toggle the ability to allow for unsafe download on or off. I think a way to disable unsafe downloads is needed as it can be used conveniently as a malware dropper, especially when models are loaded onto production environments. For suspicious models I think a toggle-able warning will suffice since most do not contain dangerouseval` which would allow for arbitrary code execution.

julien-c commented 1 year ago

I think the future-proof way is to emit a warning on any model loading from Pickle, as we default more and more to the safetensors format (and its usage is growing rapidly)

McPatate commented 1 year ago

Does it cost a lot to send a request to the hub to check before starting a download if there are dangerous imports?

It could also be done asynchronously, as in no changes in the download logic, except you add a cancellation token to the download job and if the file contains a dangerous import, the asynchronous job cancels the download, wdyt @julien-c?

EDIT: nvm my comment, the solution I'm proposing doesn't change much to the additional load issue

Wauplin commented 1 year ago

I think the future-proof way is to emit a warning on any model loading from Pickle, as we default more and more to the safetensors format (and its usage is growing rapidly)

That would definitely close this issue without additional load indeed. But that should rather live in libraries like transformers, diffusers,... isn't it? huggingface_hub is kinda agnostic on the content of the files it downloads.

julien-c commented 1 year ago

But that should rather live in libraries

Yes, correct

julien-c commented 1 year ago

or even in torch