[feature] bring your own builder

laurentsimon commented 2 years ago

it may be a useful component for others to create provenance with the same format across GH builders. See https://github.com/sigstore/fulcio/issues/754#issuecomment-1227505585

laurentsimon commented 2 years ago

Something I wanted to note is that re-usable workflows cannot call each other. So on GH, it may not be feasible to separate the "builder" and "generator" as different "trusted entities". Which is why I'm proposing a GHA to ensure the same provenance format. The. original thread (https://github.com/sigstore/fulcio/issues/754) suggested separating the entities, but It's not currently possible, AFAIK

znewman01 commented 2 years ago

Following up on conversation from https://github.com/sigstore/fulcio/issues/754:

I don't entirely follow. At least in our case, the build and the provenance generation are separate jobs. The format remains > the same, and only the buildConfig / builder.id change across builders. Agreed that if the code that's responsible for populating the buildConfig can be hijacked, it could forge the steps. But this code is part of the TCB, IIUC.

Originally posted by @laurentsimon in https://github.com/sigstore/fulcio/issues/754#issuecomment-1227505585

I want separate TCBs for provenance generation (which IMO should be higher-security, and change very infrequently) and the compilation stage. I should be able to quickly verify that I trust the provenance even if I'm not convinced about the compilation workflow. The provenance should include what compilation stage I used (without trusting the calling workflow).

Maybe you're proposing having a dedicated project for provenance generation only? We kinda of have this in the generator > repo. We don't expose it and only use it internally, though. We could, in theory, expose it thru a GitHub action.

Originally posted by @laurentsimon in https://github.com/sigstore/fulcio/issues/754#issuecomment-1227505585

Yes, exactly. But there's a couple subtleties.

First, I don't want the compilation workflow to invoke the provenance generator component, because every time the compilation workflow changes, I'd have to worry about the provenance generator too (which is higher-consequence if compromised). I want direct control over the version of provenance generator used in a repo.

But I can't call a generic provenance generator: this means I have to audit the repository to trust that the contents of the generic provenance generator are okay, even if it has a valid signature from a release of the generic provenance generator (because I may have modified the calling workflow to feed it bad artifacts).

Basically, my use case is the npm one: I want to, in an automated fashion, verify the provenance of an artifact, and that it was built using a specific compilation workflow from a publicly-known source repo at a known hash. This is fine if we trust the builder to invoke the provenance generator correctly.

Really what I want is a generic provenance generator that wraps other compilation actions: you would give it some-org/nodejs-builder@ABCDEF as an input, and it would call that. Then, it would look grab an artifact from a standard location.

Then, if I have a certificate indicating that the provenance generator came from a trusted workflow, I know that all of the provenance information is accurate, even if the build was malicious, including the build workflow that was invoked.

I have no idea if GitHub supports such a thing, and I'm probably being a little pedantic—at the end of the day, I have to trust both the provenance generation and the compilation steps, and "it's really easy to integrate standard, signed provenance into my builder" is probably good enough.

Also, I could be confused, and there could be a reasonable way to check that we have a signature from a specific version of the provenance-generating workflow over provenance that captures the compilation workflow that ran.

znewman01 commented 2 years ago

Ahh, yeah per @laurentsimon's comment upthread, what I want isn't possible on GitHub. I'm glad I wrote it down, though. Maybe a FR for GitHub Actions?

EDIT: seems like I'm not the only one to want this: https://github.com/actions/runner/issues/2079, https://github.com/actions/runner/issues/1541

znewman01 commented 2 years ago

Oh, and

Something I wanted to note is that re-usable workflows cannot call each other

may not be true anymore (as of 3 days ago): https://docs.github.com/en/actions/using-workflows/reusing-workflows#nesting-reusable-workflows

I still don't think this gets us what we want (see my comment above).

laurentsimon commented 2 years ago

may not be true anymore (as of 3 days ago): https://docs.github.com/en/actions/using-workflows/reusing-workflows#nesting-reusable-workflows

great find! I wish they announced such changes.. or maybe I'm not subscribing to the right repos...

I still don't think this gets us what we want (see my comment above).

Why not?

laurentsimon commented 2 years ago

Really what I want is a generic provenance generator that wraps other compilation actions: you would give it some-org/nodejs-builder@ABCDEF as an input, and it would call that. Then, it would look grab an artifact from a standard location.

That's basically what we do in our generators. Except that we use VM within the same workflow to call the generator and the builder, instead of re-usable workflow for each. Within our trusted workflow, we can call an external trusted builder via a GitHub action. If you trust the top-level re-usable workflow, you trust it to call the Action properly. All we really need is for the builder Action to support a "dry-run" option to get the steps in a trusted way in order to populate buildConfig of the provenance. Today we only call internal code, but it's technically do-able to call external Actions.

One issue may be to make the builder name dynamic. If we're really confident about parsing the input, we could turn script injection in our favor, but it's a little dangerous :-)

If we can call the trusted builder via an API instead of an Action, that would work. Another way would be to simply fork the repo of the trusted builder and run the code of the trusted builder, or even just grab their signed binaries from their releases and run that in a VM / job. The latter is what we do today in our builders - not sure why I got hung up on GH Actions :).

So I think this is technically do-able, without the need for re-usable workflow chains.

znewman01 commented 2 years ago

Very cool! I think that would do what I'm looking for 🚀

ianlewis commented 2 years ago

Responding to the comments on the previous issue.

Plus, now there's one provenance generator for each ecosystem that needs to be audited. Even if we're reusing components, there's not an easy way to check "my provenance came from a trusted generator, even if the builder is bad" without going into the source of the builder and parsing the workflow.

I think the plan is to share the provenance generation code with other builders for a given CI. On GitHub, we could theoretically create an Action for this. /cc @ianlewis

For a single build service with some common info this is doable. It becomes a lot harder when talking about completely different build services that might have very different data and methods for retrieving the trusted data from the builder. So doable for GitHub Actions. Much less doable for other build services like Gitlab etc.

"my provenance came from a trusted generator, even if the builder is bad" without going into the source of the builder and parsing the workflow.

IIUC This is a requirement of SLSA 3's Non-falsifiable provenance requirement. i.e. builds have to be isolated from the provenance generation such that build service users cannot change the provenance. In practice this means there must be a hard security boundary between the user's actual build code, and the provenance generator. In the case of slsa-github-generator we use different GHA jobs and thus different VMs.

Generally I expect SLSA 3 builders to implement this functionality and thus you should be able to be determine this via the builder.id.

laurentsimon commented 2 years ago

This feature is also useful to onboard scanner, like Syft. We'd just run their CLI and attest to the output. /cc @lumjjb

laurentsimon commented 2 years ago

What would be even better is if we could turn an existing GHA into an attested one. This may be possible as well. If the input to our workflow is an action, we dynamically generate the call to the action. Inputs would need to be a map, so that we can pass them to the action... I don't think this is supported yet. This means users would need to pass the input as a JSON string for us to parse... as a v0 PoC this may be acceptable.

Note: if the input is a binary CLI instead of an action, we also need the arguments, so argument passing is a common probem to solve.

laurentsimon commented 2 years ago

A simpler solution to start with could be to operate a service like https://github.com/actions/starter-workflows, but for SLSA. Ask interested builders, scanners, etc to submit a config file that describes the input, output, and commands to run.

ianlewis commented 2 years ago

I also thought about a "command" workflow that would run a command or set of commands and attest to some output. I like the idea of turning an existing GHA into an attested one but I wonder if we can get enough info from outside of the workflow.

asraa commented 2 years ago

For subject output measurement: this would require one of

User specifying some output to measure for subject
Measure the output files of some workflow directly

I like (2) a lot in the case that we have an existing GHA.

For a CLI: we would have to resort to (1)

laurentsimon commented 2 years ago

Would we have a config file on our repo, on the user's repo... or everything via workflow input parameters?

laurentsimon commented 2 years ago

Something to think about: where do we store the artifact type? DSSE? Intoto predicate type? Inside the buildConfig?

laurentsimon commented 1 year ago

Another angle to think about is branding. For an ecosystem like npm, users will be more receptive if the re-usable workflow is located on an npm org / repository, rather than in the slsa-framework repo. However, building re-usable workflow is hard, takes time and requires maintenance. Given the re-usable workflows can call each other now, we could have:

npm/npm/.github/workflows/build-and-publish.yml. Users call this re-usable workflow. This re-usable acts as a proxy and simply calls a re-usable workflow hosted on our repository, using some lower-level arguments, such as the "type" of predicate (SBOM, scanner, etc), a path to the binary, the command to run.
slsa-frameowk/slsa-github-generator/.github/workflows/builder_node_slsa3.yml does all the work, and can report the arguments (binary, command) in the provenance. For different type of artifacts (SBOM, scans and different vendors (Syft, Trivvy)), we'd need to agree on a type type: sbom/syft/whatever, scan/syft/syft-scan-type or something to this effect.

laurentsimon commented 1 year ago

Here's a way we can implement this feature, at a high-level. Let's say goreleaser is the builder / toolchain here.

Goreleaser maintainers create a re-usable workflow to wrap their Action, say at https://github.com/goreleaser/goreleaser-action/tree/master/.github/workflows/builder.yml

This workflow looks roughly like the following (I've omitted a lot of details for simplicity):

   # Checkout developer repository to scan / build / create SBOM for.
    - uses: actions/checkout@xxx

    # Run the action. 
    - uses: goreleaser/goreleaser-action
      with:
        ...

    # Create SLSA subjects
    - id: hash
      env:
        ARTIFACTS: ...
      run: |
        checksum_file=$(sha256sum "$ARTIFACT")
        echo "::set-output name=hashes::$(echo "$checksum_file" | base64 -w0)"

slsa-attest:
   uses: slsa-framework/slsa-github-generator/path/to/some_workflow.yml@v1.2.3
   with:
     base64-subjects: ${{ needs.run-tool.outputs.hashes }}
     args: <>

Users of Goreleaser can use the https://github.com/goreleaser/goreleaser-action/tree/master/.github/workflows/builder.yml instead of calling the Action.

From Goreleaser's point of view:

they keep full control of how their build works.
they don't have to re-invent the wheel for signing or creating SLSA provenance: they just use our service
their users never see the slsa-github-generator repository, and don't need to know about it.

We would need to decide what the format of the provenance looks like, so that we can report the re-usable workflow that called us inside the provenance. Here's a possible format to get the ball rolling:

  "buildConfig": {
      "version": 1,
      "builder": {
           "id":  "https://github.com/goreleaser/goreleaser-action/tree/master/.github/workflows/builder.yml@refs/tags/v1.2.0",
           "sha1": "abcdef...",
           "type": "delegatedReusableWorkflow",
        },
       // Optional steps or buildConfig provided by the re-usable workflow calling us (?)
      "steps": {
           "https://github.com/goreleaser/goreleaser-action.yaml@v1.2.3",
           "argument1", "argument1-value", etc
       }

laurentsimon commented 1 year ago

Something that's not covered in the above provenance is when there are several re-usable workflows involved. Say:

developer -> org-reusable-workflow -> gorleaser-reusable-workflow -> slsa-github-generator
                                   -> scan-reusable-workflow  -> slsa-github-generator
                                   -> slsa-github-generator (example above)

Do we want to support this? What would the provenance format look like?

@ckotzbauer please chime in if you have some advice / ideas.

znewman01 commented 1 year ago

Broadly, that plan makes a lot of sense!

The main hesitation I have is just that there's a lot of copy-paste involved in the wrapper reusable workflow, which makes it harder for a verifier to audit whether it's trusted. Not sure we can do anything about that without some big enhancements to GH Actions.

vlsi commented 1 year ago

    # Create SLSA subjects
    - id: hash
      env:
        ARTIFACTS: ...
      run: |
        checksum_file=$(sha256sum "$ARTIFACT")
        echo "::set-output name=hashes::$(echo "$checksum_file" | base64 -w0)"

Is there a "standard" format for it? In other words, does slsa-framework/slsa-github-generator consume just a list of hashes? Is the hash sha256 always or is it "bytes + algorithm"?

laurentsimon commented 1 year ago

It's currently always sha256, because it's the most popular one used in the slsa specs. The "format" we use is the output of sha256sum command. We've seen potential use cases for goDirhash recently for Go packages, but have not added support.

vlsi commented 1 year ago

Sigstore bundle spec inclines to have digest+bytes for hash.

// Only a subset of the secure hash standard algorithms are supported.
// See https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf for more
// details.
enum HashAlgorithm {
        SHA2_256 = 0;
        SHA2_512 = 1;
}

// HashOutput captures a digest of a 'message' and the corresponding hash
// algorithm used.
message HashOutput {
        HashAlgorithm algorithm = 1;
        // This is the raw octets of the message digest as computed by
        // the hash algorithm.
        bytes digest = 2;
}

https://github.com/sigstore/cosign/pull/2204/files#diff-0c949f5460747a137445ecd0069d02014cc93ab8210e1ec3588c87634d872865R27

Can the provenance be split/grouped into several files? For instance, Java project releases might have multiple "folders" within a single release:

There are cases when a single release includes 100+ modules.

They all are a part of "JUnit 5.9.1" release. However, it would be great if all those junit-jupiter, junit-jupiter-api, ... had their own attestations. In other words, junit-jupiter/5.9.1/junit-jupiter.intoto, junit-jupiter-api/5.9.1/junit-jupiter-api.intoto, and so on. It would help for discoverability, so the ones who download only junit-jupiter-api know they can resolve the attestation at junit-jupiter/5.9.1/junit-jupiter.intoto.

In that regard, if the all the sha256 for all the released files are mixed, then the resulting attestation would grow.

WDYT if the builder groups the checksums so it could get several attestations back?

laurentsimon commented 1 year ago

Sigstore bundle spec inclines to have digest+bytes for hash.

we're trying to make it easier for the caller to call the builder without the need to format their input in JSON / protobuf / etc. But we can support other hashes if need be, by providing a hash-name option. Note that the provenance itself follows the SLSA specs and does contain the hash type in the output subject{ name:xxx, digests{sha256: xxx}}.

Can the provenance be split/grouped into several files?

Thanks for bringing this up, we want to be sure we support options user need.

Technically this should be do-able. Several options I'd like to explore and get your feedback on:

A single top-level .intoto.jsonl file with multiple subjects: junit-jupiter, junit-jupiter-api, etc. Is this acceptable for your case? Or is the main problem discoverability (users won't know where to find the provenance and having it next to the artifact is more natural)?
Multiple multiple-{VERSION}.intoto.jsonl files as you propose. Would you still want a top-level bundle in this case (concatenation of all .intoto.jsonl)?
If we group hashes by subject name, what input format do you think would work for you (UX-wise)? (Note: I think we need to ensure the solution allows the use of strategy / matrix, but that's an implementation detail we can ignore for now)

@ianlewis this is also relevant for our generic generator

ianlewis commented 1 year ago

Can the provenance be split/grouped into several files?

I think this is doable. Though some granularity in the data might be lost if you want to capture the build steps in the provenance and know what all the outputs of the build process are. Usually that's not a primary concern though.

However, it would be great if all those junit-jupiter, junit-jupiter-api, ... had their own attestations. In other words, junit-jupiter/5.9.1/junit-jupiter.intoto, junit-jupiter-api/5.9.1/junit-jupiter-api.intoto, and so on. It would help for discoverability, so the ones who download only junit-jupiter-api know they can resolve the attestation at junit-jupiter/5.9.1/junit-jupiter.intoto.

For the reasons you mentioned this is usually my preference. In the absence of some discovery mechanism that doesn't exist yet, it's easiest to look for <artifact name>.intoto.jsonl and know that the artifact and provenance are related.

behnazh-w commented 1 year ago

There are cases when a single release includes 100+ modules.

They all are a part of "JUnit 5.9.1" release. However, it would be great if all those junit-jupiter, junit-jupiter-api, ... had their own attestations. In other words, junit-jupiter/5.9.1/junit-jupiter.intoto, junit-jupiter-api/5.9.1/junit-jupiter-api.intoto, and so on. It would help for discoverability, so the ones who download only junit-jupiter-api know they can resolve the attestation at junit-jupiter/5.9.1/junit-jupiter.intoto.

In that regard, if the all the sha256 for all the released files are mixed, then the resulting attestation would grow.

I have a similar usecase for a Gradle project that includes several artifacts as part of a single release. I find it useful to both have a top level provenance that is published as a GitHub release asset with multiple subjects (however that could grow very large and hence this issue https://github.com/slsa-framework/slsa-github-generator/issues/845) and separate provenances like <artifact name>.intoto.jsonl to be published to Maven central next to the artifact using a Maven/Gradle plugin.

laurentsimon commented 1 year ago

There are cases when a single release includes 100+ modules. They all are a part of "JUnit 5.9.1" release. However, it would be great if all those junit-jupiter, junit-jupiter-api, ... had their own attestations. In other words, junit-jupiter/5.9.1/junit-jupiter.intoto, junit-jupiter-api/5.9.1/junit-jupiter-api.intoto, and so on. It would help for discoverability, so the ones who download only junit-jupiter-api know they can resolve the attestation at junit-jupiter/5.9.1/junit-jupiter.intoto. In that regard, if the all the sha256 for all the released files are mixed, then the resulting attestation would grow.

I have a similar usecase for a Gradle project that includes several artifacts as part of a single release. I find it useful to both have a top level provenance that is published as a GitHub release asset with multiple subjects (however that could grow very large and hence this issue #845) and separate provenances like <artifact name>.intoto.jsonl to be published to Maven central next to the artifact using a Maven/Gradle plugin.

I wonder whether it would cause confusion that the provenance uploaded to Maven central is different from the one in GitHub release. Would users not expect to see the top-level provenance as well on Maven central?

ianlewis commented 1 year ago

@vlsi Do you have a link or something to the code or workflow that is used to generate releases like the one you described?

aalmiray commented 5 months ago

As far as I understand the BYOB feature expects custom builders to run on a single working node. What if the build requires creating artifacts in different nodes (linux, osx, windows)?

Is there a way to let the build part of the BYOB be run on separate nodes, collect artifacts, and continue with attestation?

laurentsimon commented 5 months ago

We have not implemented it yet support for different workers because nobody asked (and to start simple), but that would be possible. I think the way it'd work is that you'd call the BYOB multiple times, once for each runner you want. Would that work?

has anyone asked for support for runners besides Linux?

aalmiray commented 5 months ago

Well, invoking the jreleaser multiple times is doable but there's the problem of collating all artifacts and perhaps updating and existing release. It definitely gets trickier this way instead of building on separate workers, collect all artifacts in a single worker and perform a release.

laurentsimon commented 5 months ago

How about the following: expose to users a runner list. The jreleaser workflow then calls BYOB in different jobs (uses if: ${{ contains(inputs.runners, "runner-name" }}). Then jreleaser workflow would collect and update. Would that work or you think the BYOB itself should be in charge of running for each runner and aggregate the information?

aalmiray commented 5 months ago

That'd be great but I don't think it's feasible with the current impl of the jreleaser/java-builder. As far as I understand it, this step informs the SLSA delegator where to find the BYOB action

https://github.com/jreleaser/release-action/blob/183737d738aec2490de55a9ff35fe9ed801453b2/.github/workflows/builder_slsa3.yml#L67-L82

Then this step executes the builder action inside a worker node setup by the delegator

https://github.com/jreleaser/release-action/blob/183737d738aec2490de55a9ff35fe9ed801453b2/.github/workflows/builder_slsa3.yml#L84-L95

During this step, artifacts are built and released, then attestated.

Wouldn't setting different worker nodes outside of the delegator be considered a possible break in trust?

laurentsimon commented 5 months ago

That'd be great but I don't think it's feasible with the current impl of the jreleaser/java-builder. As far as I understand it, this step informs the SLSA delegator where to find the BYOB action

If jreleaser workflow calls BYOB for each runner, it could pass a different Action path if needed. Essentially you would have 3 setup calls, and 3 delegator_generic_slsa3 calls.

Note: you should update to using v1.10.0. There was a breaking change in Sigstore a few weeks ago and we released a new version https://github.com/slsa-framework/slsa-github-generator/blob/v1.10.0/CHANGELOG.md#v1100

During this step, artifacts are built and released, then attestated.

Correct.

Wouldn't setting different worker nodes outside of the delegator be considered a possible break in trust?

I don't think it would. The use still needs to trust the jreleaser workflow which is considered the builder. BYOB is merely a framework to help write your own builders.

aalmiray commented 5 months ago

If jreleaser workflow calls BYOB for each runner, it could pass a different Action path if needed. Essentially you would have 3 setup calls, and 3 delegator_generic_slsa3 calls.

Then the delegator would have to accept a parameter specifying which OS should be used by the worker, right? At the moment the worker is explicitly set to Linux

https://github.com/slsa-framework/slsa-github-generator/blob/4534a0b24500dfdd11685f2950cba9a35086c4d2/.github/workflows/delegator_generic_slsa3.yml#L129-L135

It would have to be parameterized to support all managed runners, with Windows being problematic because of scripts and Macos because of docker (for some builds I suppose)

Note: you should update to using v1.10.0.

I'll do so shortly.

laurentsimon commented 5 months ago

If jreleaser workflow calls BYOB for each runner, it could pass a different Action path if needed. Essentially you would have 3 setup calls, and 3 delegator_generic_slsa3 calls.

Then the delegator would have to accept a parameter specifying which OS should be used by the worker, right? At the moment the worker is explicitly set to Linux

yes you're correct. We would add support for other runners. Let me know if this is something people have asked support for.

slsa-framework / slsa-github-generator

[feature] bring your own builder #763