Playbook for hub integration

laurentsimon commented 1 month ago

We need a playbook to explain how a hub would integrate our library and what verification needs to be supported. Here is a list of integration paths (not necessarily in the right priority order, which may depend on hubs)

Hub + UX

The hub verifies the signature and adds a "Signed by X" in the UX. The hub accepts any signer

Hub with identity enforcement

Hub enforces that the right signer has signed the model. The "right" signer is not always obvious. A proposal is to have TOFU (trust on first use), ie trust the first identity that uploaded the model. Any change in identity needs to be challenged via, eg 2FA. Because certain models may be uploaded by different users, we may have to trust on first k signers. This may be a configuration settings that, if it need to be changed, requires a user challenge (2FA). Details TBD

Hub provides list of signing history

The hub keeps metadata and evidence (signatures) about who signed the model for different model versions. The list is displayed via UX, may be available thru a REST API. This allows anyone to monitor for signing identity changes.

Hub provides immutable list of signing history

Like the previous point, but stores the history in a transparency log. This reduces the trust in the hub. This can also pave the way to binary (model) transparency given the right type / info in log.

Client verify with expected identity

The hub framework API (e.g., huggingface API) lets users verify model given a set of identities (PKI, Sigstore identity). This is mostly useful for users who know who sign the model, so likely most useful when a user verified their own model.

Clients trust a third party to verify

In this case, the client trusts a third party for verification and does not provide a signer identity to the framework API. Trusting a third party is often necessary because the identities of the legitimate signers is difficult to determine, and legitimate changes to signer identities is hard to assess. The third party would typically be the hub (which performs verification, ideally with "identity enforcement" point 2 above) and / or the monitors (when a transparency log is used to monitor identities). The hub may attest to verifications by issuing an attestation "I, the hub, have verified model X [with config Z], and the signature for this model can be found at Y". Z could be "identity verified and enforced thru 2FA, etc". The attestation may be verified on the client side when loading a model. (Trusting the TLS connection to the hub is a good first step without the need to verify an attestation from a hub / transparency log)

@McPatate please feel free to share your thoughts.

mihaimaruseac commented 1 month ago

CC @haydentherapper as we were discussing this too

McPatate commented 1 month ago

Thanks for the writeup @laurentsimon.

FYI on our side we usually integrate standards/formats/3rd-party tooling on the hub when we observe sufficient usage within the community to justify the integration.

Nonetheless, the places where I see this has the most value is in the web UI of the Hub & in huggingface_hub or any call to download model weights (or any other file? Are datasets in scope?)].

The easy part imo is displaying the signer or some metadata contained in the cert in the UI.

Re:

Hub with identity enforcement

I'm not convinced that you must be the owner of a certificate to upload it to your repository; usage of a non-owned model could definitely be legit, e.g. a fork of a repo. And by owned I mean the identity linked with your account must be present in the cert. The above seems to make more sense in the context of an enterprise, but then again, can specific people sign on behalf of the org?

Side question, does a certificate support a signing chain? Say Meta releases Llama-4 and I fine tune that model, will I need to overwrite the cert entirely or can there be proof that I modified the original file?

I agree though that matching a Hub username to his private key / cryptographic identity is a useful feature (sorry if that's incorrect, not very up to speed on how the certs are created). I would be careful though before displaying "safe" or "unsafe" badges all over the UI.

My gut feeling is telling me I'd rather rely on an external identity provider to confirm "ownership" than TOFU.

laurentsimon commented 1 month ago

I'm not convinced that you must be the owner of a certificate to upload it to your repository; usage of a non-owned model could definitely be legit, e.g. a fork of a repo. And by owned I mean the identity linked with your account must be present in the cert.

Good point. Maybe this can be viewed as a dependency rather than model ownership / maintainer? Do you have example repos we could look at?

Side question, does a certificate support a signing chain? Say Meta releases Llama-4 and I fine tune that model, will I need to overwrite the cert entirely or can there be proof that I modified the original file?

For signing, I'm not sure, because you can't trust what the signer says, ie how do you verify that they're not lying about how they transformed the original model? With SLSA, we can have such a cert chain + lineage data if the training platform is able to record runtime dependencies. (We have a demo for Jupyter notebooks)

My gut feeling is telling me I'd rather rely on an external identity provider to confirm "ownership" than TOFU.

An identity provider provides unforgeable proof of an entity's identity. TOFU is for mapping an identity to a repository / model. So they go hand-in-hand.

mihaimaruseac commented 1 month ago

+1 to Laurent. We need SLSA to record the provenance of downstream models.

sigstore / model-transparency