slsa-framework / slsa

Supply-chain Levels for Software Artifacts
https://slsa.dev
Other
1.54k stars 223 forks source link

Git URI and digests for SLSA provenance #214

Open AdamZWu opened 2 years ago

AdamZWu commented 2 years ago

SLSA provenance define digests as "cryptographic digests for the contents of the artifact". However, the git examples use "commit hash" which does not match the specification.

A possible fix is to use the git "tree hash", e.g.:

"configSource": {
  "entryPoint": "build.yaml:build",
  // The git repo revision that contains the build.yaml referenced in the entrypoint.
  "uri": "git+https://github.com/foo/bar.git@<commit-hash>",
  // The tree hash reflecting the content of the repo used for this build.
  "digest": {"sha1": "<tree-hash>"}
}

However, SPDX download location seems to only allow either commit hash or branch, not both: https://spdx.github.io/spdx-spec/package-information/#77-package-download-location-field

So we'd lose the ability to describe branch name for git repo. :(


An alternative is to change the SLSA provenance definition for this field,

from: "cryptographic digests for the contents of the artifact"

to: "cryptographically secure unique identifier for the specific revision of the artifact"

Since there is no requirement prevent the same content having multiple unique identifier, we can comfortably use git commit hash for digests, and use "@" for branch name.

MarkLodato commented 2 years ago

I was interpreting "artifact" as a commit, not not a tree. Most builds do a git clone so that the .git directory is available. Thus, the build may actually read git metadata, such as to build a changelog. In that case, I think it's more correct to use the commit ID, not the tree ID.

Is there a use case for indicating the tree ID instead?

AdamZWu commented 2 years ago

First of all, commit IDs are very useful, and it should be part of provenance, not arguing with that. :D

Second, the points I am trying to raise have no immediate needs of use. So this is not a "bug" thread, but a "discussion".

With the above clarifications, I have two separate concerns / cases:

  1. Verifying attestations' binding with artifacts. The "digest" value is also used as the attestation subject.

    So far we have one well-defined attestation -- the build provenance; and to verify a build provenance's binding with an artifact, we check the digest of the artifact's content.

    I really like the elegance behind this operation - it can be done efficiently, black-box (without understanding of the internals of either the artifact or the attestation), and standalone (without needing any external information). I would like to see if we can maintain this as a shared underlying "macro property" for all attestations. The benefit is that this can serve as a cornerstone for building up higher-level trusts, at almost no cost.

    For example, there are cases where the builders do not have direct access to source repo, so they will consume pre-packages sources. Of course we can always treat all source packages as literal build artifacts, but the extra synthetic "build" hop would abstract away many useful properties that users could have checked in their policy -- if we were able to treat (non-mutated) source packages as if they were original sources. The easiest way to achieve that is for the source control attestations to use source code content digest (i.e. tree hash) as subject.

    Not that this is the only way, but if we have this common attestation binding verification property, we get it basically free.

    On top of that, IIUC, we have to verify the binding between every attestation and their attested artifact at some point in the supply chain anyway, so why not have a simple and universal design and be done with it?

    The issue with commit id as subject is that, it cannot be as easily verified as tree hash. The verification will need to understand the internals of git repo, and/or be able to contact the original source control.

  2. Processing of material URIs The placement of commit id also feels a bit strange when we imagine how the materials will be used.

    Suppose we have a reproducible build provenance, and I want to reproduce the build. I would enumerate the build materials and fetch everything. Normally, I'd take each "uri" and give to some fetcher which will give me what I wanted.

    But not for git.

    The SPDX download location for git repo entries are incomplete -- I won't be able to reproduce my build if I use that directly.

    I will need to watch out for git URIs, and when encountered:

    1. Take some information (commit hash) from the digest (which I normally won't do for fetching other artifacts);
    2. Parse the incomplete SPDX download location;
    3. Re-assemble a new SPDX download location with the commit hash;

    If commit id is a part of the URI, then fetching source code will be a smooth sail just as fetching any other artifacts from their URIs.

MarkLodato commented 2 years ago

Thanks! That definitely helps me understand.

On a related note, one idea I had was that we could pass around the literal git commit object to link the commit ID to the tree ID (but not necessarily the other way around.) The consumer could hash it (to verify the commit ID) and read out the tree ID. Not super desirable but might be an option for a workaround.

TomHennen commented 2 years ago

I wonder if the solution is to list all the relevant things in the digest?

This was discussed a bit here: https://github.com/in-toto/attestation/issues/28

What if digest contained both git-commit-sha1 and git-tree-sha1? (And whatever else might be useful)

bureado commented 2 years ago

Possibly related, https://hackmd.io/@aeva/draft-gitbom-spec