pickle is a security issue

KOLANICH commented 3 years ago

🚀 Feature

We need to do something with it.

Motivation

Pickle is a security issue that can be used to hide backdoors. Unfortunately lots of projects keep using torch.save and torch.load.

Pitch

make pytorch.load use pickle only as a serialization format, use an own virtual machine (https://github.com/CensoredUsername/picklemagic can be helpful) for processing pickle files that will do only allowed operations in pytorch itself in a completely controlled way instead of relying on pickle machinery.
replace with ONNX
deprecate pytorch.load, pytorch.save
remove pytorch.save/make it save into ONNX

Alternatives

support pickle via a VM indefinitely.

cc @mruberry @nairbv @NicolasHug @vmoens @jdsgomes @ailzhang

vadimkantorov commented 3 years ago

Related: https://github.com/pytorch/pytorch/issues/52181 ... One of my suggestions was supporting hdf5 for safe weights storage

rgommers commented 3 years ago

It already has a clear warning in https://pytorch.org/docs/master/generated/torch.load.html:

Deprecating doesn't seem warranted; the Python pickle module itself (https://docs.python.org/3/library/pickle.html) has a similar warning. As does numpy.load (the issue is smaller there, but it exists and will remain with a docs warning).

A good alternative does seem needed indeed. That part of the issue seems duplicate with gh-52181.

vadimkantorov commented 3 years ago

torch.hub.load doesn't have such a big clear notice + may be good it printed a UserWarning that's easier to notice (not everyone would check in docs these "simple" methods). + I also didn't see such big red warnings at model zoos like hugging face

a nightmare scenario: some "cool" recent model uploaded to hugging face by some malicious actors, publicized on twitter, and then gets backdoor access at research labs over the world... it must execute code, i propose to have some this_is_dangerous_and_can_in_principle_execute_malicious = True mandatory keyword argument to these code-executing methods, so that the user is actively aware of this

KOLANICH commented 3 years ago

It already has a clear warning in https://pytorch.org/docs/master/generated/torch.load.html:

It doesn't prevent people from using it for loading models and from storing models into this format.

i propose to have some this_is_dangerous_and_can_in_principle_execute_malicious = True mandatory keyword argument to these code-executing methods, so that the user is actively aware of this

Not enough. Phasing out it is the only feasible way in long term. In short term the solution is to defuse pickle used and make it clear not only to devs of the software but also to users that they should blame the devs still using pickle. Such dangerous temporary solutions when introduced should have a clear deadline. Also IMHO it was a big mistake to introduce pickle into python at all.

zou3519 commented 3 years ago

cc @ailzhang for hub -- what do you think? Folks are concerned about who can post to hub and if downloaded models can contain viruses.

cc @suo for package or deploy: I remember there was some discussion around unifying the serialization behavior of torch.save/load (and using zipfile serialization?). One of the things proposed in this issue is to change the serialization format for PyTorch. We probably don't want to go down the route of yet another serialization format here.

vadimkantorov commented 3 years ago

For zipfile - I hope it doesn't resort to good old pickle torch.load to handle loading of weights from inside the zipfile.

In my opinion, all existing frameworks, including TF / tfjs have screwed up weights serialization, so we already ended up with 10 different formats :( So adding another "good" one should not be a problem, if it is secure, performant and interoperable.

vadimkantorov commented 3 years ago

Also I wonder if huggingface does anything about verifying who submits models there. I think huggingface hub is now getting more popular than the slow-moving torchhub one

KOLANICH commented 3 years ago

I don't think WHO sends the model is the main issue. If the models themselves can not be Turing-complete and a complex of a model + pytorch machinery by itself cannot be Turing complete and if models don't allow priveleged operations within computation graph, such as calling any API which effect is not simple computation, we are likely safe, if operations with numbers are implemented correctly (they can be not: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3177 ).

So at first we should concentrate not on WHO can upload models, but on WHAT models can do. Authenticating models via digital signatures is clearly useful, but its effect will be very minor.

The problem with models is that they are large, binary and hard to audit for humans. It won't help much to know that the model was published by huggingface, if we still don't trust huggingface and have no real reasons to trust them.

Also there are parties other than huggingface. Should we not to use their models because we don't trust them?

So we need not attribution in the first place, but a proof that any model cannot compromise integrity of a system and confidentiality of the information processed in it in some sensible threat model. So we trust not the persons issuing the models, but the beleif that our threat model is sensible.

malfet commented 2 years ago

After looking at https://github.com/python/cpython/blob/0d04b8d9e1953d2311f78b772f21c9e07fbcbb6d/Lib/pickle.py#L1137 implementation (which can be considered a reference implementation of the feature), it looks like above requirements can be achieved by restricting what classes/methods are safe to instantiate via find_class method, that torch.load already overrides in https://github.com/pytorch/pytorch/blob/1a33e944b58a75efe6154f1d02a32b80b7661edf/torch/serialization.py#L1086-L1093

dhrubo-os commented 1 year ago

Is there any update on this issue?

Is this true also for torchScript file? If I trace a model and then load the model, will this risk be still there? In my understanding, when we trace the model, only weight can be picked up.

Can anybody please confirm?

@KOLANICH, @malfet @vadimkantorov

Thanks.

JaimeArboleda commented 7 months ago

Hello! I am concerned about this issue. In our organization, the Security team wants to avoid those type of risks, but on the other hand, we data scientists want to work with torch hub and huggingface hub. I was wondering if there is any way of checking the safety of a torch model in an environment like Google Colab prior to downloading it in our private servers. Do you think this could work?

pytorch / pytorch