slsa-framework / slsa

Supply-chain Levels for Software Artifacts
https://slsa.dev
Other
1.55k stars 225 forks source link

Discussion: Possibly ambiguous language regarding the use of cache artifacts #894

Open adityasaky opened 1 year ago

adityasaky commented 1 year ago

Currently, the Provenance spec reads as follows regarding caches:

During execution, the build process might communicate with the build platform’s control plane and/or build caches. This communication is not captured directly in the provenance, but is instead implied by builder.id and subject to SLSA Requirements. Such communication SHOULD NOT influence the definition of the build; if it does, it SHOULD go in resolvedDependencies instead.

Zeroing in on communicating with the cache specifically, I'm parsing this as "if using a cached artifact changes the build definition, record it in resolvedDependencies." First of all, is this a fair reading of the sentence?

If yes, it raises the question of what changing the build definition means. buildDefinition describes the inputs to the build. The two ways I see that using a cache can impact the inputs to the build is a) as a configuration to use the cache, and b) by treating a cached artifact as an input to the build as it changes the build result in some way.

The first is entirely disconnected from actually using a cached artifact in my mind because a build can be configured to use a cache and end up with zero hits. Further, this configuration wouldn't actually end up in resolvedDependencies, but presumably one of the parameters fields? On the other hand, the second raises other questions about cache behavior.

  1. Does a cached artifact qualify as a build input? Possibly yes if the builds are not reproducible, but even then, we have variation between two build executions that each don't use the cache.
  2. How do we even know if a cached artifact impacts the build if builds aren't repeated or otherwise tested for reproducibility? (It also raises the possibility of a build being reproducible when using a cached artifact which is not reproducible itself.)

The requirements table calls out the impact of caches for the isolation requirement:

It MUST NOT be possible for one build to inject false entries into a build cache used by another build, also known as “cache poisoning”. In other words, the output of the build MUST be identical whether or not the cache is used.

This seems to indicate that using a cache must not have an impact on builds beyond a configuration to use or not use the cache (it also alludes to reproducibility if I'm reading this right). Which means cache-specific behavior is never expected to be recorded in Provenance? Regardless of what the intent is, I think some clarifying may be in order! :smile:

arewm commented 1 year ago

Interesting, the part that makes the least sense to me is

This communication is not captured directly in the provenance, but is instead implied by builder.id

I generally read these sections indicating that all cached content use for the build must have an immutable reference (i.e. a hash) of the content to be used for addressing. If the build system leverages a cache in a build, the provenance should include the cached content as it was originally cached.

Therefore, a build system may cache dependencies pulled/requested by builds as long as the cache is immutably referenceable and the build system maintains all record of the cache's sources to be included in the provenance.

If, for some reason, the cache is a change from the original, then the resolvedDependencies would have to capture that. I would expect that this is an action that the build system would perform regardless of whether the cache is used in order for the following clause to be true:

In other words, the output of the build MUST be identical whether or not the cache is used.

adityasaky commented 1 year ago

If the build system leverages a cache in a build, the provenance should include the cached content as it was originally cached.

Could you restate this? I'm having trouble parsing this, sorry.

Therefore, a build system may cache dependencies pulled/requested by builds as long as the cache is immutably referenceable and the build system maintains all record of the cache's sources to be included in the provenance.

I think I agree with the first part but I'm not sure about the second, because the Provenance spec seems to indicate that the use of a cached artifact must be recorded only if it impacts the build.

If, for some reason, the cache is a change from the original, then the resolvedDependencies would have to capture that.

Just to clarify, do you mean that a cached artifact changes? So, say an artifact foo-1.1.0 is cached with hash A but at some point it's overwritten with hash B because it's not reproducible?

In other words, the output of the build MUST be identical whether or not the cache is used.

I think this is also a little confusing because it is dependent on the build being reproducible, which is not currently a requirement.

arewm commented 1 year ago

Let me try to restate my argument instead of directly responding to all of the questions.

I see a cache as an implementation detail of a build system. It saves time and processing for a build by saving off "some state" of a build so that a future build which has the same reference can reuse the same state without having to regenerate it. To this end, a cached object is almost like an artifact itself -- but an intermediate artifact. While SLSA isn't recursive, if we treat a cached item as an intermediate artifact, then that artifact should have a known provenance according to the build platform's targeted SLSA level.

Therefore, if a build system is producing an artifact by means of a cache, I would expect that it would be able to maintain the provenance of said cache and keep it associated when the cached artifact is created. When a future build comes along and pulls in that cached artifact, the build system will then be able to inject the same (original) provenance into that of the final resulting artifact.

The reproducibility of a cached artifact is a separate concern. Some caches may be reproducible, but others may not be. Caches could contain "raw" dependencies (i.e. go/npm/pip packages) or it could contain output from previous build steps. If any processing is required to produce a cached artifact then this should be represented in the resolvedDependencies in the final artifact's provenance. If all cached items used for a build are themselves considered to be consistent with some bar of reproducibility then they shouldn't disqualify an artifact pulling in those cached artifacts from a maximum of that same bar of reproducibility.

A cache could literally just be a set of intermediate artifacts which are then aggregated for the final artifact. In essence, this wouldn't be any different than executing all required steps in the build without the use of a cache.

Ultimately, the management of a cache including its population and retrieval from it are gated by the build system at some level. These processes should ensure (for build L3 at least) that appropriate controls are in place for builds to refer to cached artifacts with immutable references -- in a way to prevent cache poisoning. This may be using digests/hashes of requested resources, it may be using trusted output from previous build steps, it may be some other process. The important part is that the build system is capable of preventing the poisoning. If for some reason a cached artifact is overwritten then it shouldn't be retrievable using the old cached artifact's references.

MarkLodato commented 1 year ago

Thanks for pointing this out. I agree that it is ambiguous and ill-defined. Here a stab at a better set of definition / requirements (we'll need diagrams, but hopefully this will do for now). Sorry for the long post.

Definitions (without build cache)

For example, suppose a build had the external parameters {"source": "https://git.example.com/foo.git", "command": "./build.sh"}. This does the following:

Then foo.git and libz.tar.gz are dependencies, foo.o (among others) is an intermediate artifact, and foo is an output artifact.

In provenance, intermediate artifacts SHOULD NOT be recorded while dependencies MUST be recorded at the prospective future Build L4.

Build cache

A build cache is an optimization to a build that reuses intermediate artifacts and/or dependencies from prior builds rather than building from scratch and/or fetching from an external resource, respectively. Logically a build cache SHOULD NOT have a material impact on the behavior of the build, meaning that the output SHOULD be identical whether or not the cache is used. However, in practice most build caches are vulnerable to "cache poisoning" attacks, where one build can insert build cache entries such that another build will behavior differently had the build cache not been used.

Continuing our example above, suppose build.sh uses a build cache for foo.o, keyed by hash(foo.c). If hash(foo.c) exists in the cache, the build will use that value as foo.o rather than compiling from foo.c. This is vulnerable to cache poising because a prior build can insert a false entry for hash(foo.c) that was compiled using a different, possibly malicious process.

Therefore, unless excepted below:

Exception: If a build platform guarantees through its design that a build cache is not vulnerable to cache poisoning attacks, then cached intermediate artifacts can be ignored in the provenance while cached external dependencies can be treated the same as coming from the original source. In practice, this requires the following:

Coming back to the example, if the build were rearchitected such that the compilation of foo.c into foo.o occurred in a separate build environment and the cache entry were keyed by a SHA256 hash of all inputs to that separate build environment, then the cache would likely not be vulnerable to cache poisoning attacks.

Open questions

If you agree with the above, then does the discussion of build caches really belong at Build L3, or is it just an L4 thing? If a build opts-in to a cache, e.g. with https://github.com/actions/cache, then I think it's out of scope for L3. What if it's enabled by default - does that affect L3 now?

arewm commented 1 year ago

Cached intermediate artifacts MUST be considered dependencies and SHOULD have their own provenance. This mostly has an impact for the future Build L4 where all dependencies MUST be recorded in externalDependencies.

Do you mean that all dependencies MUST be recorded in resolvedDependencies?

The perspective that I shared earlier came from one where the build platform is managing the cache entirely. Builds might request data via the build platform and the platform is capable of determining whether a cached item can be returned or whether a new artifact must be built/retrieved. In this scenario, I think it is still valid to include restrictions on caches in L3 as indicted in the specification:

It MUST NOT be possible for one build to inject false entries into a build cache used by another build, also known as “cache poisoning”. In other words, the output of the build MUST be identical whether or not the cache is used.

If builds are managing/updating caches, then I think that falls along the same "well-intentioned build" statement that is included in L3 as well. A side effect of this is that the cached dependencies may not be represented in the provenance as the control plane may not know about the dependencies.

There are no sub-requirements on the build itself. Build L3 is limited to ensuring that a well-intentioned build runs securely. It does not require that a build platform prevents a producer from performing a risky or insecure build. In particular, the “Isolated” requirement does not prohibit a build from calling out to a remote execution service or a “self-hosted runner” that is outside the trust boundary of the build platform.

Therefore, I believe that the clarification belongs in L3. A L4 requirement would be to effectively dis-allow any caches that are not controlled by the build platform.

In provenance, intermediate artifacts SHOULD NOT be recorded while dependencies MUST be recorded at the prospective future Build L4.

This confused me initially, but after reading https://slsa.dev/provenance/v1#rundetails, it makes sense. These intermediate artifacts are those produced during a build and do not have use for future builds after completion.

MarkLodato commented 1 year ago

[...] future Build L4 where all dependencies MUST be recorded in externalDependencies.

Do you mean [...] resolvedDependencies?

Oops, yes. Edited to fix the typo.

where the build platform is managing the cache entirely. [...] In this scenario, I think it is still valid to include restrictions on caches in L3 as indicted in the specification

Yeah, that makes sense.

Right now, it says that, if a build cache is used, it MUST NOT be susceptible to cache poisoning from prior builds. What I was suggesting was perhaps we could relax to this to either that OR you consider the cache untrusted and thus anything fetched from the cache equivalent to an external dependency. Though now that I say that, I'm not so sure. What do you think?

arewm commented 1 year ago

The cache requirements cannot be enforced if the build platform is not in full control of it. Therefore, I think that we can clarify in L3 to indicate that a cache run/operated by the build platform MUST NOT be poisonable. If any other cache is used in the build itself then the no build sub-requirements statement would continue to hold.

That being said, I would expect that anything the build platform pulls from the cache will ultimately be represented as a resolvedDependency as I noted earlier --

Therefore, if a build [platform] is producing an artifact by means of a cache, I would expect that it would be able to maintain the provenance of said cache and keep it associated when the cached artifact is created. When a future build comes along and pulls in that cached artifact, the build [platform] will then be able to inject the same (original) provenance into that of the final resulting artifact.

Therefore, when the build platform generates the provenance for the artifact, it should be able to resolve and include all of the dependencies which are pulled from the cache.

MarkLodato commented 1 year ago

That being said, I would expect that anything the build platform pulls from the cache will ultimately be represented as a resolvedDependency as I noted earlier

I'm not sure that is always practical or desirable. For example, consider a Bazel-based build platform that uses Remote Execution under the hood and caches intermediate artifacts using the Content Addressable Storage (CAS). Some builds have >100k intermediate artifacts. Recording all of these artifacts in the resolvedDependencies would make it enormous (though perhaps it would just have to list the direct dependencies), and generating provenance for all of them might be infeasible. (But I'm not 100% sure here.)

That's why I was thinking that one might just ignore the cache if the risk of cache poisoning is sufficiently low.

arewm commented 1 year ago

Would a Bazel-based system be able to differentiate between intermediate artifacts and the resolved dependencies? I wasn't trying to say that everything pulled from the cache should be indicated as a resolved dependency. Instead, the resolved dependencies used to create the cached artifacts used in a build must be captured in the resulting build's provenance.

901 is an attempt to represent this.

MarkLodato commented 1 year ago

Oh, I see. Yeah, I think that aligns with my thinking. In other words, the it would look the same whether or not the cache is used?

arewm commented 1 year ago

Yep, exactly.

adityasaky commented 1 year ago

Thanks for all the responses! I'm going to respond to some points by both of you here.

Logically a build cache SHOULD NOT have a material impact on the behavior of the build, meaning that the output SHOULD be identical whether or not the cache is used.

I think I agree with this statement. And this is why I was considering the reproducibility of the cached intermediate artifact. Even with strong protections against cache poisoning, using a previously built foo.o from the cache could mean foo is not identical to the foo we get when we build everything afresh. This is also why you use SHOULD instead of MUST there, I think?

Additionally, regardless of whether it's because of irreproducibility of the intermediate artifact or cache poisoning, we can't know the impact of using a cached artifact without repeating the build of foo, with and without the cached intermediate artifact. And this test depends on the build of foo being reproducible in the first place. So requiring the cache artifact to have no "material impact" becomes an L4 concern rather than L3, even if the cache is enabled by default. The current spec, to my reading, hints at these L4 concepts indirectly at L3, which is the cause of confusion. If you agree, then at L3 we probably want to say the provenance must record cache artifacts regardless of their impact, even if the build system has strong protections against cache poisoning simply because we also need reproducibility.

While SLSA isn't recursive, if we treat a cached item as an intermediate artifact, then that artifact should have a known provenance according to the build platform's targeted SLSA level. Therefore, if a build system is producing an artifact by means of a cache, I would expect that it would be able to maintain the provenance of said cache and keep it associated when the cached artifact is created. When a future build comes along and pulls in that cached artifact, the build system will then be able to inject the same (original) provenance into that of the final resulting artifact.

I'd argue that as SLSA is not recursive, the spec seems to indicate that we'd have recorded provenance for only the final artifacts and not for intermediate artifacts. I may be missing text that says otherwise though. I see Mark says something similar with "Cached intermediate artifacts MUST be considered dependencies and SHOULD have their own provenance" so I suspect I'm missing some information.

david-a-wheeler commented 1 year ago

A "SHOULD" seems weak. Is there a way to reword this into a MUST?

Here's one try:

A build cache (if used) MUST NOT invalidate the SLSA provenance claims.

I don't think that's quite right but that's the sense of the direction I was trying to go.

kpk47 commented 1 year ago

Discussed in community meeting on 24 July 2023. Action item for community: review PR #901