skops-dev / skops

skops is a Python library helping you share your scikit-learn based models and put them in production
https://skops.readthedocs.io/en/stable/
MIT License
442 stars 52 forks source link

Secure persistence: Idea: Add fingerprint #144

Open BenjaminBossan opened 2 years ago

BenjaminBossan commented 2 years ago

Maybe it's possible to add a cryptographically secure fingerprint of the whole saved object. Then, during loading, we can compare whether the loaded object's fingerprint is identical. This could prevent certain types of attacks where the object that's being loaded is not identical to the one being advertised.

adrinjalali commented 1 year ago

Are you talking about a hash of the stored object, or a fingerprint created by a private key of the person/entity creating the object?

We can add the capability to sign the stored object, if that's what you mean.

BenjaminBossan commented 1 year ago

I'm not an expert in cryptography, so maybe I'm not always using the correct words.

a hash of the stored object

That's what I meant. I'm not exactly sure how to do that, if we just want a checksum and then the platform can provide some code to verify the checksum (which would be outside of the scope of skops).

Maybe we can also do something within skops, where skops checks that the checksum/fingerprint is as expected. My initial idea is that each attribute we store creates a hash of itself and then on the object containing these attributes creates a hash of these hashes, recursively up to the root (which would be the schema), like a Merkle tree. Then the user only needs to verify this top level hash against a secure source and skops can check the hashes of all nodes when loading.

We can add the capability to sign the stored object

Not sure how useful that is, maybe?

adrinjalali commented 1 year ago

I don't think that's very useful, since whoever can tamper the artifact, can tamper with the checksum as well.

Signing would be nicer since we can trust certain people who have signed artifacts, the same way we can do for git commits.

BenjaminBossan commented 1 year ago

I don't think that's very useful, since whoever can tamper the artifact, can tamper with the checksum as well.

That's true, but say, I download the model from a trusted website but get MITMed. Then I can still check that the final checksum is identical to the one on the website and decide not to open it if it differs.

Whether there is an additional benefit in skops checking internally, I'm not really sure.

Signing would be nicer since we can trust certain people who have signed artifacts, the same way we can do for git commits.

Could be, I would consider this to be a separate issue though.

adrinjalali commented 1 year ago

That's true, but say, I download the model from a trusted website but get MITMed. Then I can still check that the final checksum is identical to the one on the website and decide not to open it if it differs.

If you get a MITM for the file, you can also get a MITM for the checksum.

Generally I'm not sure how much checksums add to the security, I see them more like "check if a file is corrupted" kinda thing.

Maybe @narsil or @mcpatate would have other ideas.

BenjaminBossan commented 1 year ago

If you get a MITM for the file, you can also get a MITM for the checksum.

I think there can be situations where the model file was tampered with but I can still trust the website. But maybe that's too narrow, I don't know.

McPatate commented 1 year ago

Usually if the HTTPS certificate is valid, it's pretty unlikely that you got MiTM-ed.

As Adrin says, imo signing is the best way to do this, but I don't see added value from signing commits - do you want to do something separate from the Hub ?

EDIT: though I guess signing and checksum are basically the same thing, you need a trusted authority to compare the result to.

BenjaminBossan commented 1 year ago

A possible exploit could be if there is a website that mirrors the content. Of course, those could have been tampered with, but if the fingerprint matches with the one from the original source, I could still trust the file. But maybe that situation is too narrow?

McPatate commented 1 year ago

Since github uses SHA-1 to create ids for files, I you could always compute it when you download a file from a mirror and then compare that to the blob's id like this :

$ git ls-tree HEAD
100644 blob 4c850c0da7bec718eb41dec4e96a8980ec1b3ea9    .gitignore
from hashlib import sha1

with open('.gitignore', 'r') as f:
    gitignore = f.read()
content = "blob " + f"{len(gitignore)}" + "\0" + gitignore
print(sha1(content.encode("utf8")).hexdigest())
# prints 4c850c0da7bec718eb41dec4e96a8980ec1b3ea9

I don't see why people would need to go through mirrors when interacting with the hub, but I may be mistaken :)

BenjaminBossan commented 1 year ago

I don't see why people would need to go through mirrors when interacting with the hub, but I may be mistaken :)

If skops persistence finds adoption, it could be used completely independently of the hub. Though of course we can wait and see before implementing a feature that's not needed.

Narsil commented 1 year ago

@adrinjalali is on point.

Checksum is for verifying validity. It could be just flaky network random flips, or actual MiTM attacks. In order to check against MiTM you need two different channels for the file and the checksum (so that the attacker needs to handle BOTH channels to change things under you).

If you really want to protect against MiTM attacks, then signing files is more efficient. That's because usually the certificate/public key distribution mecanism is by design on a different channel. There is a 3rd trusted party that handles the management of these certificates, usually hardcoded in your browser/package manager. For iOS apps, Apple is responsible for handling the signed binaries.

It's still a pretty big endeavor if you have ever done iOS/MacOS dev you would know that the signing/registering of apps is a bit of a pain.