Discussion: Use a Docker image for building binaries/files prior to generating SLSA provenances

rbehjati commented 2 years ago

Follow-up discussion about the idea of using a Docker image as the builder/releaser, as we have in project Oak.

In project Oak, and as part of our transparent-release work, we use a builder image for building binaries. The builder image is a Docker image, which has all the tools required for building the binary installed, and the required environment variables set. It might be interesting to use a similar idea here for building the binaries and generating the provenances. This can be used as an alternative to tools like go-releaser.

Currently in our tooling for transparent-release, the build command is a docker run command that runs a given command in the builder image. When generating SLSA provenances, we include this information as the BuildConfig. See also our custom buildType. In addition, we include the builder image in the list of materials. The build tool fetches the specified docker image and ensures that the command for building the binary is executed using the fetched builder image. The builder image is identified by a URI containing the digest of the image. If the versions of the toolchains are fixed in the Dockerfile (example from Oak) and the checksums are verified, then this can get very close to the idea of a trusted builder.

Here is an example of such a SLSA provenance file, with BuildConfig and materials as described above.

This is our GitHub action that generates provenances. We generate provenances for each commit that is merged into the main branch. It currently doesn’t use the build tool from transparent-release (because the build does not yet generate a provenance file), but we plan to use this build toll with a simple TOML file similar to this example. The idea is to have the TOML file checked into the repo as a static file (containing only the command, the output_path and a few other fields), and let the GitHub actions job fill out the commit_hash and the builder_image URI that are different for each commit and invocation of the build tool.

A similar minimal TOML can be used here for building the binary using a builder image provided by the maintainers of the repo.

Note that in our approach, we don’t fill out the invocation part in the SLSA predicate, as we think all the information is provided in the buildConfig, and materials.

cc @laurentsimon, @tiziano88

laurentsimon commented 2 years ago

Thanks for sharing @rbehjati !

I really like the approach of using docker to accommodate complex build systems. We could "easily" wrap this up in a reusable workflow to streamline the the work.

@MarkLodato @asraa wdut?

asraa commented 2 years ago

I really like the approach of using docker to accommodate complex build systems. We could "easily" wrap this up in a reusable workflow to streamline the the work.

+1!!!

I really like this -- currently this provenance generator is responsible for creating signed provenance populated with github context information. I was just chatting with Laurent on how we could easily create shared code that could apply to these docker image builds and other use cases on github workflows. For general use cases like yours:

(1) We can use the output of HostedActionProvenance https://github.com/slsa-framework/slsa-github-generator/blob/9a875d0adc1f3d8339e210938a4b2543f5cd3984/slsa/provenance.go#L34 to create a base statement with Github Workflow context information (2) Provide library functions to augment with specific information like the buildConfig, custom buildType, and materials (3) Generate and output signed provenances with either (A) Use Fulcio signer in this library if a raw output is needed or (B) cosign CLI signers in the case of uploading the provenance attached directly to the image (we can also raw output this).

In this tool the AI's for the above process would be:

Support augmenting the provenance in library
Output just the predicate json instead of the intoto statement. this would allow passing the predicate json to cosign

asraa commented 2 years ago

Are you planning on signing the provenance in order for the output of the trusted builder to be non-forgeable?

If so, where would you hold the provenance? Maybe on the provenance branch with a file commit like .sig? The .sig file would contain something like this, which contains the sig, certificate, and offline info to verify the transparency log information. Cosign understands this format as input. https://gist.github.com/asraa/6471825cb23aaa053292348edcea0e2e

ianlewis commented 2 years ago

This looks pretty neat. I like the idea of having the build and provenance generation be more fully encapsulated.

If we can make sure that the context where the builder image gets executed is safe enough and can't alter the provenance down the line, we can maybe allow folks to specify their own image and the reusable workflow will be responsible for executing it and incorporating the information on the commands run into the provenance.

It seems worth exploring more.

laurentsimon commented 2 years ago

I also thought about users giving their own dockerfile: I was wondering if there are options (say, root docker) which may compromise security. But I think the consensus is that this is out of scope of SLSA, so I think it would work too.

ianlewis commented 2 years ago

Yeah, I from a technical standpoint we need to make sure that whatever does the build can't modify the provenance (i.e. command X was run but command Y was added to the provenance) but other than that we don't necessarily care what the build actually does).

I think we should be able to to things like get things like the entrypoint from the image? or just add the docker run command w/ image & image hash as a build step?

We probably do need a way to extract the build artifacts from the build job in a safe way too. Maybe the docker image would need to get the sha256sum of the artifacts and upload them itself? and then pass the names off to the provenance step somehow?

tiziano88 commented 2 years ago

I think we should be able to to things like get things like the entrypoint from the image? or just add the docker run command w/ image & image hash as a build step?

We probably do need a way to extract the build artifacts from the build job in a safe way too. Maybe the docker image would need to get the sha256sum of the artifacts and upload them itself? and then pass the names off to the provenance step somehow?

I believe both these are covered by @rbehjati 's suggestion: the repo owner declares the Docker image digest (which is content-addressed, and needs to have been pre-built and pushed to some Docker registry) and the current git commit (which is also content-addressed) is mounted at a specific (possibly configurable) path under that docker image; also the owner specifies the command to run under this configuration, and what output file to measure; the measurement needs to be done in a trustworthy way by the workflow after the command has finished running (i.e. we don't need to trust the owner for this either).

These parameters may be provided to the workflow directly, or ideally as a separate TOML file. The latter has the advantage that we can build (and in fact have already built) tooling that parses it and runs the same exact steps locally, i.e. on the developer's machine, without having to rely on GitHub actions for instance while debugging things. Also it can be ported trivially to other CI systems; e.g. the same TOML file may be the input to the GitHub actions workflow, but also a Google Cloud Build workflow (without having to maintain two distinct workflow files in sync).

Finally a provenance file is created with all these pieces of information, plus whatever other metadata we get from the builder via the OIDC token.

tiziano88 commented 2 years ago

Are you planning on signing the provenance in order for the output of the trusted builder to be non-forgeable?

If so, where would you hold the provenance? Maybe on the provenance branch with a file commit like .sig? The .sig file would contain something like this, which contains the sig, certificate, and offline info to verify the transparency log information. Cosign understands this format as input. https://gist.github.com/asraa/6471825cb23aaa053292348edcea0e2e

Another option that I would like to explore at some point is to set up an instance of https://github.com/google/ent (a universal Content Addressable Store) and store provenances there. It would be a nice counterpart to Rekor: Rekor stores signatures over hashes, Ent stores the actual artifacts (or provenances, or anything else really) indexed by their hash.

rbehjati commented 2 years ago

If we can make sure that the context where the builder image gets executed is safe enough and can't alter the provenance down the line, we can maybe allow folks to specify their own image and the reusable workflow will be responsible for executing it and incorporating the information on the commands run into the provenance.

That is exactly the idea with the build tool in transparent-release. It would be nice to further generalize this tool and turn it into a reusable library that can be used in GitHub workflows.

ianlewis commented 2 years ago

That is exactly the idea with the build tool in transparent-release. It would be nice to further generalize this tool and turn it into a reusable library that can be used in GitHub workflows.

Yeah, I'm sure we could do something similar and then to run the provenance generation step in a separate job from the builder as I'd not like to trust the builder job at least after the untrusted container gets executed.

It seems the path to the repo could be provided or you fetch it. I'm sure we could just checkout the repo in a build step, but I'm curious if there is any any benefit you saw to having the builder fetch the repo itself? https://github.com/project-oak/transparent-release/blob/58721e709f89052ccbec8282543354f9a396dfdb/common/common.go#L351

rbehjati commented 2 years ago

It seems the path to the repo could be provided or you fetch it. I'm sure we could just checkout the repo in a build step, but I'm curious if there is any any benefit you saw to having the builder fetch the repo itself?

It is just for convenience. When running as a GitHub action, the repo is already checked out, so the option of passing the path to the builder would have to be used.

laurentsimon commented 2 years ago

cc @loosebazooka working on distroless - this may be the way we can generate provenance using a script/dockerfile

ianlewis commented 2 years ago

Just for clarity, there are two discussions we can have

Allow folks to provide Dockerfile the workflow builds and/or pushes
Allow folks to build an arbitrary artifact using a Docker image (this is what the Oak folks are proposing)

I think both have merit but probably need to be discussed separately. Probably implemented as different workflows.

laurentsimon commented 2 years ago

Once the container workflows are available, we can also verify the container image provenance before using it as a builder (in the re-usable workflow).

Any strong reasons for using TOML vs yaml?

laurentsimon commented 2 years ago

follow-up: do we actually need a config file? The container should be able to do everything, especially for complex build that require fetching sources from different places. I would imagine the interface to the builder could be just:

uses: path/to/builder
with:
  output-folder: ./some/folder/
  dockerfile: path/to/Dockerfile   OR.  # Note: this would assume the Dockerfile is stored on the same repository
  image: gcr.io/bla@sha256:xxx

dockerfile is great for a human to read, but poor in terms of immutability (apt-get, etc). The image is great for immutability, but harder for a human to inspect unless it itself has provenance.

rbehjati commented 2 years ago

Any strong reasons for using TOML vs yaml?

I don't have a strong preference :)

follow-up: do we actually need a config file?

Perhaps not. The snippet you have provided should work. I think our original design was intended at reusing the same builder image for building several binaries. But you could wrap all the options for building each binary into a new Dockerfile and a separate docker image, and just use that for building the binaries. This is perhaps a better solution, especially if we are providing provenances for docker images too (which should be straightforward).

tiziano88 commented 2 years ago

follow-up: do we actually need a config file?

I think it's still nice to have a dedicated config file (in fact, one per target, of which there may be multiple). The main use case for this is that it would then be possible to build tooling to make it possible to run the build locally (even though of course it would not generate a signed provenance). For instance, my main problem with GitHub actions at the moment is that it's impossible to run something similar to their own builder locally. At least GCB does allow triggering a job from a local machine, but ideally we should be able to run something equivalent completely locally and offline, and verifying the output.

In principle we could also build something that parses a GitHub actions workflow file to extract these fields from there, but it seems backwards to me.

I am thinking of these TOML / YAML files as targets in a Makefile, and it should be possible to invoke any of them, without involving GitHub actions at all. In fact, GitHub actions should itself delegate to our builder, and the GitHub actions workflow file should simply point to the relevant TOML / YAML files IMO.

Re: TOML vs YAML, I don't mind too much, but TOML is substantially simpler to parse correctly than YAML, while still being sort of human readable. I don't think most users would even notice the difference anyways.

rbehjati commented 2 years ago

I agree with Tiziano. Minimal BuildConfigs like the ones we currently have in transparent-release would be nice.

laurentsimon commented 2 years ago

follow-up: do we actually need a config file?

I think it's still nice to have a dedicated config file (in fact, one per target, of which there may be multiple). The main use case for this is that it would then be possible to build tooling to make it possible to run the build locally (even though of course it would not generate a signed provenance). For instance, my main problem with GitHub actions at the moment is that it's impossible to run something similar to their own builder locally. At least GCB does allow triggering a job from a local machine, but ideally we should be able to run something equivalent completely locally and offline, and verifying the output.

You can trigger a remote build from a local machine using a GH API. Local builds are not possible, though. Unless you use something like https://github.com/nektos/act. But that's not universal for all builders.

I don't know if the repo config is needed, because it's up to the container to do what it wants with it.. and maybe even ignore it. So maybe we could simply the config?

The two options that seems necessary are builder_image and output_path: the rest can be defined in the docker container, it seems? (I may be missing some nuances).

For a rebuilder, the expected_sha256 is necessary. But maybe this should be handled by another entity which applies some policy on the builder's results instead? That'd allow us to simplify the builder.

Or we need to think differentiate the 2 use cases: builders and re-builders...?

rbehjati commented 2 years ago

If you are calling docker run directly in the GitHub actions, then I agree that reop and commit_hash are not needed. They are already clear from the context, and can be included in the provenance statement.

If the idea is to have a separate builder tool, for instance for better testability, similar to what we have in transparent-release, then repo and commit_hash must be explicitly provided to the builder tool. Alternatively, you could include those configs directly in the builder image, but then it means that a different builder image would be needed for each commit, which sounds inconvenient. This would also require more code to be reviewed (i.e., a separate Dockerfile file must be reviewed for each commit).

For a rebuilder, the expected_sha256 is necessary. But maybe this should be handled by another entity which applies some policy on the builder's results instead? That'd allow us to simplify the builder.

I agree. We are going to completely remove expected_sha256.

tiziano88 commented 2 years ago

We should also clarify exactly what the trust model is, and in particular what gets cryptographically bound to what and by whom. For instance, AFAICT, Fulcio is the root of trust that binds the identity of a job with a fresh signing certificate. Presumably information about the job is embedded in the certificate itself, but we need to start from that. Similarly, Fulcio in turn trusts a token generated by GitHub actions itself, so we should also look at what that token contains, and how it is bound to the workload. For instance, I expect the commit hash to be bound to the certificate somehow, but it would be good to clarify what a verifier would have to do to confirm this; in particular, a verifier would probably not trust the commit hash field in the provenance file, but it would actually look at the one bound in the certificate (or perhaps compare both of these, and ensure they are correct).

laurentsimon commented 2 years ago

If you are calling docker run directly in the GitHub actions, then I agree that reop and commit_hash are not needed. They are already clear from the context, and can be included in the provenance statement.

How about the following:

The builder in this repo need not be aware of a config file. All it needs to take as input is:

uses: path/to/builder
with:
output-folder: ./some/folder/
dockerfile: path/to/Dockerfile   OR.  # Note: this would assume the Dockerfile is stored on the same repository
image: gcr.io/bla@sha256:xxx
configuration: something # An opaque string interpreted by the dockerfile / container image

Here users may want to access some env variables / GitHub context, so we would forward them docker -e _bla_. This is flexible enough that any maintainer can call it the way they want.

The provenance file would attest to the repo / hash and container image (TBD where we'd report it)

The transparent release is one use case for the generic builder. A user who wants to use it would use the container image provided by the Oak team, and call it as:
```
uses: path/to/builder
with:
output-folder: ./some/folder/
image: gcr.io/oak-builder@sha256:xxx
configuration: ./path/config.toml
```

Would the above work?

laurentsimon commented 2 years ago

For instance, I expect the commit hash to be bound to the certificate somehow, but it would be good to clarify what a verifier would have to do to confirm this; in particular, a verifier would probably not trust the commit hash field in the provenance file, but it would actually look at the one bound in the certificate (or perhaps compare both of these, and ensure they are correct).

the repo / hash is bound to a cert, you're correct. It's also inside the provenance and a verifier should be able to trust it so long as they trust our builder, which is also embedded in the cert that Fulcio signs.

Let me know if this clarifies the trust model or not.

rbehjati commented 2 years ago

Would the above work?

I think we can make it work. But I am still a bit worried about the testability of this approach. Before writing the workflow action, we'd want to be able to test it locally. I suppose occasionally people might need to debug the build as well. What is required for testing this locally?

What does the builder (obtained from path/to/builder) do? I expect it to only build the binary, and then the workflow will (1) generate the provenance, (2) sign the provenance, and (3) publish the provenance to Rekor. Is that correct?

laurentsimon commented 2 years ago

Would the above work?

I think we can make it work. But I am still a bit worried about the testability of this approach. Before writing the workflow action, we'd want to be able to test it locally. I suppose occasionally people might need to debug the build as well. What is required for testing this locally?

Since the Dockerfile + config defines everything, a user should be able to run the equivalent docker build -e ... command locally. Let me know if I missed something.

What does the builder (obtained from path/to/builder) do? I expect it to only build the binary, and then the workflow will (1) generate the provenance, (2) sign the provenance, and (3) publish the provenance to Rekor. Is that correct?

Correct. Nothing else.

rbehjati commented 2 years ago

Thanks for yesterday's meeting. The following is my summary of the discussions, referring heavily to the following suggestion from @laurentsimon:

uses: path/to/builder
with:
  output-folder: ./some/folder/
  image: gcr.io/oak-builder@sha256:xxx
  configuration: ./path/config.toml

The builder in uses: path/to/builder is a reusable workflow that the SLSA team will provide. It can potentially reuse some of the functionality in transparent-release/cmd/builder.
The builder runs docker run .... Additional arguments, similar to the ones in transparent-release/cmd/builder may be required, for instance to mount the current working directory (i.e., the root of the repo) to workspace.
Main inputs to the builder are:
- The git commit hash and the URI to the git repo (fetched from the GitHub context),
- The id of the builder image (specified explicitly by the user),
- The output path (specified explicitly by the user),
The BuildConfig in transparent release has currently an additional command filed. Assuming that the command is baked into the builder docker image (which should be possible), we don't need to pass command to the builder.
Similarly, configuration is most likely not required. It could be made optional or dropped entirely. Perhaps start without it in the initial implementation and only add it if there are use cases that need it.
A test tool that simulates the behaviour of the builder should be provided to allow local testing and experimentation. In particular, this will save the user from having to manually build the docker run ... command with all the flags and options.
Hopefully the work will be completed in early October.

Remaining questions:

Should Dockerfile be supported in addition to a docker image?
Decide on the format and contents of the buildConfig and materials in the generated provenance. The builder image, as well as the repo (together with the git commit hash) must be included as materials. For the buildConfig the most important piece of information that is not covered in materials is the output-folder. Anything else that should go to the buildConfig?
[Bonus question, not discussed in the meeting]: Should not output-folder be output-path? The build command may generate additional files that we might not want to include in the provenance.

Please add to or correct my summary if anything is missing or incorrect.

tiziano88 commented 2 years ago

Thanks for the update @rbehjati and @laurentsimon !

I am not sure I understand how this would be used for transparent release in practice, and especially how someone would verify the generated provenance file. Could we go through an example? It would help me understand things better.

Perhaps let's consider this build config file from Oak: https://github.com/project-oak/oak/blob/87a33746f3f512ec3ece204fa26704bdf9a08846/buildconfigs/oak_functions_loader_base.toml

What would the corresponding workflow be?

In particular, how would things work without having the configuration field?

And why is the output path in the workflow instead of the build config? Is it because the workflow needs to do something special with it, and cannot parse the config file?

The builder runs docker run .... Additional arguments, similar to the ones in transparent-release/cmd/builder may be required, for instance to mount the current working directory (i.e., the root of the repo) to workspace.

I think we should not allow providing extra arguments if we can avoid it at all. I would prefer we literally hardcode a single "standard" mount path (I think /workspace is a good candidate) at least for now, and we can change it if necessary in the future. Anyways if you all really think it should be configurable from the start, we can add it as a field in the build config file.

The BuildConfig in transparent release has currently an additional command filed. Assuming that the command is baked into the builder docker image (which should be possible), we don't need to pass command to the builder.

I don't think baking the command in the docker image scales well; for instance, in Oak, we want to use the same docker image, but run different commands for different targets. Hardcoding the command in the image would require us to create (and maintain) as many images as commands, plus additional images for local development. But maybe you meant something else and I misunderstood the point?

Should Dockerfile be supported in addition to a docker image?

I suggest not supporting Dockerfile, since in general building a docker image is not an idempotent operation, even from the same Dockerfile. This may introduce subtle issues when things appear to run on the same image, but actually the images are completely different. e.g. imagine the Dockerfile has a command that fetches a resource from a URL, and the target of the URL changes over time. This would be solved if we used content addressable stores for everything, but we are not there yet :) (shameless plug for https://github.com/google/ent )

laurentsimon commented 2 years ago

And why is the output path in the workflow instead of the build config? Is it because the workflow needs to do something special with it, and cannot parse the config file?

I think this is to let the docker image parse it. This way the re-usable workflow we provide in this repo could be used for any purposes: users who just want to define their pipeline via a Dockerfile / image, or transparent release users who use a common container image provided by you. Someone who wants to re-build will have the "configuration" available in the provenance , so they should be able to re-run the build. In a nutshell, we're just saying that this configuration is opaque to the the builder in this repo, and the container image can interpret the way it wants.

The BuildConfig in transparent release has currently an additional command filed. Assuming that the command is baked into the builder docker image (which should be possible), we don't need to pass command to the builder.

I don't think baking the command in the docker image scales well; for instance, in Oak, we want to use the same docker image, but run different commands for different targets. Hardcoding the command in the image would require us to create (and maintain) as many images as commands, plus additional images for local development. But maybe you meant something else and I misunderstood the point?

Using a "configuration" should solve the problem. The container image (which you control) will get the repository from the GH env variables, checkout the repo and read the config, then extract the relevant information, including the path to the script.

When re-running locally, we start with the provenance instead. The provenance records the env variables, GH context, the configuration string, and the command to run docker run -e GITHUB_BLA -e GITHUB_CONFIGURATION, etc. So you can re-play the run.

You could replay it in GCB as well.

Would that work?

AI on my end is to read your paper, so let me know if I mis-understood something.

rbehjati commented 2 years ago

I don't think baking the command in the docker image scales well; for instance, in Oak, we want to use the same docker image, but run different commands for different targets. Hardcoding the command in the image would require us to create (and maintain) as many images as commands, plus additional images for local development. But maybe you meant something else and I misunderstood the point?

Using a "configuration" should solve the problem. The container image (which you control) will get the repository from the GH env variables, checkout the repo and read the config, then extract the relevant information, including the path to the script.

My understanding is that the generic builder would not have to read or parse the config file. If so, this proposal sounds good to me. What do you mean by the path to the script?

When re-running locally, we start with the provenance instead. The provenance records the env variables, GH context, the configuration string, and the command to run docker run -e GITHUB_BLA -e GITHUB_CONFIGURATION, etc. So you can re-play the run.

I suppose the command will be more complicated than that. For instance, for the image to be able to use the config file, it should be mounted. Also, some testing might be required before setting up the workflow on GitHub. At that point there is no provenance to start from. So I still think some additional tooling for testing locally should be provided. But we can work out the details of that as we make progress with the design.

laurentsimon commented 2 years ago

I don't think baking the command in the docker image scales well; for instance, in Oak, we want to use the same docker image, but run different commands for different targets. Hardcoding the command in the image would require us to create (and maintain) as many images as commands, plus additional images for local development. But maybe you meant something else and I misunderstood the point?

Using a "configuration" should solve the problem. The container image (which you control) will get the repository from the GH env variables, checkout the repo and read the config, then extract the relevant information, including the path to the script.

My understanding is that the generic builder would not have to read or parse the config file. If so, this proposal sounds good to me. What do you mean by the path to the script?

Sorry for the confusion. What I meant is that if "configuration" option is a path to a script, then your image will be responsible for reading this file. If the "configuration" is JSON-like or any other format, then you can use it right away. It's up to you to decide. Does this work?

When re-running locally, we start with the provenance instead. The provenance records the env variables, GH context, the configuration string, and the command to run docker run -e GITHUB_BLA -e GITHUB_CONFIGURATION, etc. So you can re-play the run.

I suppose the command will be more complicated than that. For instance, for the image to be able to use the config file, it should be mounted. Also, some testing might be required before setting up the workflow on GitHub. At that point there is no provenance to start from. So I still think some additional tooling for testing locally should be provided. But we can work out the details of that as we make progress with the design.

I agree that some tooling will be required but hopefully the can be hidden from end users.

rbehjati commented 2 years ago

Yes. This generally sounds good to me. Thanks.

laurentsimon commented 2 years ago

suggest not supporting Dockerfile, since in general building a docker image is not an idempotent operation, even from the same Dockerfile. This may introduce subtle issues when things appear to run on the same image, but actually the images are completely different

If the image is the same, ie same hash this would be fine, correct?

I've been thinking about dockerfile support. I think many users will want to declare a Dockerfile, and not go thru the extra steps of generating a container image themselves. GHA today can do that for users, ie you can define a Dockerfile to build a GHA. One idea could be to support Dockerfiles, but cache the corresponding images for subsequent builds. Something like the following:

compute the dockerfile digest DD, and look for a container "builder:DD" on ghcr.io. If the container exists:
- Verify its provenance.
- Extract the commit from the provenance, fetch the dockerfile at the commit and compute its digest DC. Compare it to DC == DD. If true, we can use the image to build. If not, create the container image and push it to ghcr.io (permissions packages: write needed).
build using the container image

Would this work? Wdut?

Note: I don't think this is needed by the release transparency project, we can continue supporting an image as one of the inputs.

rbehjati commented 2 years ago

Would this work? Wdut?

I am not a Docker or GHA expert, so I cannot really say if it works or not, but the solution sounds good to me, especially given your note about the release transparency project. Thanks.

slsa-framework / slsa-github-generator

Discussion: Use a Docker image for building binaries/files prior to generating SLSA provenances #23