Semantic equivalency, reproducible builds, and a new "verifiable build" track

david-a-wheeler commented 1 year ago

I have an idea that probably needs some refinement, but I think there may be something here. In short: there's a newer "backoff" idea of reproducible builds called "semantic equivalency" that is somewhat easier to achieve than reproducible builds. Since there's a backoff system, this suggests to me that it may be appropriate to have a whole new track. The current build track imposes requirements on protecting a build and sharing information about the build. The possible new track "verifiable build" imposes requirements on the ability to independently verify the results of a build. There are other possibilities, e.g., making "semantic equivalency" part of a new SLSA level 4, and reproducible builds in SLSA level 5 (I'm not sure where hermetic builds goes in that case). Below is my thinking, discussion welcome!

===

Reproducible builds are where you rebuild software from a given source and produce a bit-for-bit identical version of the built result. In many ways it's a "gold standard" for verifying builds. Reproducible builds were proposed for SLSA in #5.

Different source code typically produces different build results, but you can identify the the source with a cryptographic hash. Different tools typically produce different results, so you must specify the tools used. But the biggest challenge for reproducible builds is that there are many ways, such as embedded timestamps, that can make it hard to create a reproducible build.

In some situations reproducible builds are trivial or at least not hard. In others, they're easy because the developers have spent many hours to achieve reproducible builds. Good for them! But others find it challenging to create reproducible builds. The survey "SLSA++: A Survey of Supply Chain Security Practices and Beliefs" (published 2023, survey was done in 2022) has info on reproducible builds. In particular, reproducible builds and hermetic builds were considered much more difficult than the other practices surveyed; Over 50% of respondents stated that this practice was either extremely difficult or very difficult.

The tool "OSSGadget" includes a tool to measure that they're about to call "semantic equivalency" or "semantically equivalent". (They once used the term "reproducible build", but that was confusing, so they're going to switch names to make the idea clearer to everyone.) A project build is semantically equivalent if "its build results can be either (1) recreated exactly (a bit for bit reproducible build), or if (2) the differences between the release package and a rebuilt package are not expected to produce functional differences in normal cases."

For example, builds would be considered semantically equivalent if the differences only included differences in date/time stamps. It'd also be fine if the build added/removed files that would not affect the execution of the code presuming that the code was not malicious to start with and followed "normal" practices, for example, adding/removing a ".gitignore" file (we would expect that a non-malicious program would not run ".gitignore" and wouldn't do something different depending the presence of ".gitignore").

@scovetta pointed me towards this "semantically equivalent" measure, and I think it's really promising. Sure, it'd be best if builds were reproducible, but where that's unavailable and those involved are unwilling to change the build process, what's the alternative? This alternative enables end-users to estimate the likelihood of it being maliciously built (presumably as a part of decideing whether or not the package is safe to install).

I had previously mooted the idea of "reproducible builds but ignoring date/timestamps" (because date/timestamps are a common problem for creating reproducible builds). My commendation to the OSSGadget developers & others for developing this alternative.

The threat model is a little different in the case of "semantically equivalent". The assumption isn't that "it is impossible for these differences to cause damage". The assumption is that "the original source code was benign, reasonably coded, and did not do damage". The question is, is this non-reproducible package likely to have been generated from it, even though it's not a reproducible build?"

Here's an example that might clarify the threat model. It's possible that a program could look for ".gitignore" and run it if present. The source code repo might not have a .gitignore file, but the malicious package added .gitignore and filled it with a malicious application. That would cause malicious code to be executed, but it would also be highly suspicious to run a ".gitignore" file (that's not what they are for), so it's reasonable to assume that the source code didn't do that. If an attacker can insert a file that would cause malicious code to execute in a reasonably-coded app, then that would be a problem. "What's reasonable" is hard to truly write down, but a whitelisted list of specific filenames seems like a reasonable place to start.

Sure, ideally everything would have a reproducible build. Since that day isn't here, what can we do to take piecemeal steps towards that?

Making this a separate track has its advantages. Semantically equivalent builds, and reproducible builds, imply the ability for independent verification by the recipient. Of course, this only matters if someone does the verification, but it is different than making assertions about the process used to create the build being acquired.

SLSA version 1.0 only had build levels 1-3 defined, because there were many challenges in working out how to define level 4. Maybe a separate track is the way forward. If not, maybe "semantically equivalent" goes in a new level 4, with "reproducible build" being in level 5. In any case, this idea of "semantically equivalent" gives us something new to discuss and think about.

Discussion welcome.

Links:

kpk47 commented 1 year ago

I'm not generally in favor of treating reproducibility as anything other than a strict binary. Putting that aside, I've got a few questions about how a semantic equivalency track would work:

Who is the intended user for a semantic equivalency track? What does their workflow look like? What value does the track bring to them?

The name "semantic equivalence" implies a property of two or more packages, whereas reproducibility is a property of a single package. Do we need more than one package for verification? Who provides that second package? If the second package is trustworthy enough to use as a benchmark for the one being verified, why not use it in the first place?

What would a semantic equivalency track look like? Assuming that L0 means "not reproducible" and L means "bit-for-bit reproducible," we'd need a principled way to divide the space between into levels. Having a known end level also constrains our ability to react to changes in the state of the art for determining semantic equivalency -- we'd either have to modifying existing levels or add levels that break the convention of using counting numbers in increasing order (e.g. Repro L5 and Repro L6 might be weaker than Repro L4).

What would we attach the level to for a semantic equivalence trace? A package, a particular version of a package, a repo, a project, something else? Do we expect to pass around an attestation for the 's level, or would we add it to the build provenance? What does the attestation mean (who is attesting to what)?

david-a-wheeler commented 1 year ago

@kpk47 :

I'm not generally in favor of treating reproducibility as anything other than a strict binary.

That's fine! But I wanted to start a discussion.

Who is the intended user for a semantic equivalency track? ... What value does the track bring to them?

A potential user of a package who wants to know that "the built package I'm installing corresponds to the source code it putatively was generated from". Ideally they want 100% confidence, but more is better than less.

What does their workflow look like?

Something like, "Run OSSGadget & see if it reports semantic equivalence (or that it's a reproducible build)". I'm sure more details would need to be ironed out before it went anywhere.

The name "semantic equivalence" implies a property of two or more packages, whereas reproducibility is a property of a single package. Do we need more than one package for verification?

No. The idea is that "When I re-execute the build from known source, I get a package that is semantically equivalent to the package posted". Naming is hard; they originally called it "reproducible builds" but that was confusing since most people mean "bit-for-bit" when they say "reproducible build". Naming is hard.

What would a semantic equivalency track look like? Assuming that L0 means "not reproducible" and L means "bit-for-bit reproducible," we'd need a principled way to divide the space between into levels. Having a known end level also constrains our ability to react to changes in the state of the art for determining semantic equivalency -- we'd either have to modifying existing levels or add levels that break the convention of using counting numbers in increasing order (e.g. Repro L5 and Repro L6 might be weaker than Repro L4).

Agreed. I currently proposed only 1 intermediate state.

scovetta commented 1 year ago

To add a little context, and what I was thinking about when I started writing the tool -- I wanted to determine the likelihood that a (for example) npm package actually reflected the source repository it was linked to. IIRC, there were a bunch of cases of malware where the registry account was compromised but the source repo wasn't, and malicious version published clearly didn't bear any resemblance to the repo contents.

Obviously, it's better if projects have clear build scripts defined, but many don't. So I came up with different strategies that seemed reasonable:

The package contents match exactly the repo contents
The package contents are a strict subset of the repo contents (e.g. .gitignore isn't included)
The package can be built (we try a bunch of different ways)

Based on which strategies work, we assign a rough confidence.

There's definitely room for improvement here.

Here's sample output from my favorite string padding library:

$ oss-reproducible -a -d pkg:npm/left-pad@1.3.0

   ____   _____ _____    _____           _            _
  / __ \ / ____/ ____|  / ____|         | |          | |
 | |  | | (___| (___   | |  __  __ _  __| | __ _  ___| |_
 | |  | |\___ \\___ \  | | |_ |/ _` |/ _` |/ _` |/ _ \ __|
 | |__| |____) |___) | | |__| | (_| | (_| | (_| |  __/ |_
  \____/|_____/_____/   \_____|\__,_|\__,_|\__, |\___|\__|
                                            __/ |
                                           |___/
OSS Gadget - oss-reproducible 0.1.357+c946c93324 - github.com/Microsoft/OSSGadget

Analyzing pkg:npm/left-pad@1.3.0...
Downloading package...
Locating source...
Downloading source...
Out of 4 potential strategies, 3 apply. Analysis will continue even after a successful strategy is found.

Results:
 [FAIL] PackageMatchesSourceStrategy
  (S+) /github-stevemao-left-pad-1.3.0/left-pad-1.3.0/.gitignore
  Diffoscope results written to ddfd0b1b-2d03-4cae-bb5c-50459723de30.html.
 [PASS] PackageContainedInSourceStrategy
  Diffoscope results written to 7b556d86-96c1-4388-b6c2-31be34567ac5.html.
  [-] AutoBuildProducesSamePackage
 [FAIL] OryxBuildStrategy
  (P ) /npm-left-pad@1.3.0/npm-left-pad@1.3/package/package.json
  ( S) /node_modules/resolve/test/resolver/symlinked/package/package.json
        -   "name": "left-pad",
2)      +     "main": "bar.js"
        -   "version": "1.3.0",
        -   "description": "String left pad",
        -   "main": "index.js",
        -   "types": "index.d.ts",
        -   "scripts": {
        -     "test": "node test",
        -     "bench": "node perf/perf.js"
        -   },
        -   "devDependencies": {
        -     "benchmark": "^2.1.0",
        -     "fast-check": "0.0.8",
        -     "tape": "*"
        -   },
        -   "keywords": [
        -     "leftpad",
        -     "left",
        -     "pad",
        -     "padding",
        -     "string",
NOTE: Additional differences exist but are not shown. Pass --show-all-differences to view them all.

  (P ) /npm-left-pad@1.3.0/npm-left-pad@1.3/package/index.js
  ( S) /node_modules/resolve/test/resolver/multirepo/packages/package-a/index.js
        - /* This program is free software. It comes without any warranty, to
        -  * the extent permitted by applicable law. You can redistribute it
        -  * and/or modify it under the terms of the Do What The Fuck You Want
        -  * To Public License, Version 2, as published by Sam Hocevar. See
        -  * http://www.wtfpl.net/ for more details. */
        - module.exports = leftPad;
2)      +
3)      + var assert = require("assert");
4)      + var path = require("path");
5)      + var resolve = require("resolve");
6)      +
7)      + var basedir = __dirname + "/node_modules/@my-scope/package-b";
8)      +
9)      + var expected = path.join(__dirname, "../../node_modules/jquery/dist/jquery.js");
10)     +
11)     + /*
12)     +  * preserveSymlinks === false
13)     +  * will search NPM package from
14)     +  * - packages/package-b/node_modules
15)     +  * - packages/node_modules
16)     +  * - node_modules
NOTE: Additional differences exist but are not shown. Pass --show-all-differences to view them all.

  Diffoscope results written to 7d00796b-35d5-4ce8-b95c-e8907f7257d9.html.

Summary:
  [75%] Package is a subset of the source repository contents, with no ignored files.

arewm commented 1 year ago

I commented on potential levels for a "reproducible" track in https://github.com/slsa-framework/slsa/issues/230#issuecomment-1563332926.

One related set of requirements from the 0.1 spec is pinned dependencies. Content from the linked comment:

Splitting the pinning into another track makes sense. The build track and its included hardening is specifically focused on ensuring the generated provenance is accurate, complete, and authentic. With this being the focus of the build track, I think that we can include the previously-hermetic requirement without using the word (except for potentially calling out the replacement of the v0.1 hermetic requirement). I don't know what this requirement's "degree" summary would be, maybe something like "dependency-aware"?

Instead of focusing on dependencies for its own track, however, I can see it fitting within a track around reproducibility. Some potential levels for that might be

Sources and dependencies pulled via TLS Sources and dependencies are pinned to a specific revision and the hashes are verified [...?] Bit-for-bit reproducibility

A binary reproducibility might be the highest bar, but that doesn't mean we cannot illuminate the lower levels to get there, highlighting their benefit. One challenge here, of course, is whether the different levels are "common enough" to warrant being grouped together.

kpk47 commented 1 year ago

We discussed this in our weekly specification meeting and decided to move it to our backlog. It is a large issue that could be split up and possibly deduped with existing issues, but nobody volunteered to pick it up.

david-a-wheeler commented 1 year ago

I couldn't attend the meeting last week. I'm interested in supporting the work, so we have a volunteer :-).

sudo-bmitch commented 1 year ago

SLSA version 1.0 only had build levels 1-3 defined, because there were many challenges in working out how to define level 4. Maybe a separate track is the way forward. If not, maybe "semantically equivalent" goes in a new level 4, with "reproducible build" being in level 5.

This part, where we have tiers for "semantically equivalent" and then "reproducible build", gets a big +1 from me.

A project should have a motivation to get timestamps out of their build when possible, and I think this gives them that carrot. One of my back burner projects involves comparing the responses to external requests (e.g. dependency downloads during a build) and there are parts of those responses that will always be different (e.g. auth tokens). The focus for me has been excluding comparisons on parts that I expect to change while flagging parts that should be the same, which fits in nicely with the "semantically equivalent" that we're defining here.

david-a-wheeler commented 1 year ago

The threat being countered is the case where the package on the repository is subverted, say by a subverted build process or rights to the distribution repository, but the attacker didn't have the permissions necessary to subvert the source code. Reproducible builds (when verified) counter this risk, because they can detect this case. Semantic equivalency also counters this risk (not quite as well as reproducible builds).

david-a-wheeler commented 1 year ago

Brandon Mitchell to Everyone (Sep 11, 2023, 12:42 PM) made this great quote:

timestamps in a logfile, it's always timestamps somewhere

MarkLodato commented 1 year ago

My feeling is that reproducible builds don't need a separate track, but rather we ought to document how they are a possible solution to achieve the Build track. That connection is very unclear in the spec now, so it would be good to make that more clear.

In the case that the build is not deterministic, such that multiple builds result in different output bits, then that seems like a challenge for the implementation: how do you know that two different builds are "close enough" to determine that it still qualifies as SLSA Build Level X. One way would be an claim (either an explicit attestation or an implicit part of the builder) that "benign" changes are ignored. I like the OSS Gadget approach of talking about levels of similarity, which is effectively a "confidence" about how benign the changes are.

david-a-wheeler commented 1 year ago

Note: In meeting today 2023-09-11, Mark L thought this might be better within the current build track instead of a separate track. Marcela was inclined the same way as well. So we end up with more levels.

arewm commented 1 year ago

I wasn't able to attend the call today again unfortunately. Would properties related to this be added in L4 and above or would there be "properties of reproducibility" that are associated with levels 1-3 as they exist now as well?

david-a-wheeler commented 11 months ago

Per the meeting today, it was decided to craft some text in a Google doc. Here's the Google doc: https://docs.google.com/document/d/1Jk0yZnkTC3dfp8G5dmO8K9r1Kc7TRX2QVOwcFSKw1OQ/edit

arewm commented 11 months ago

I wasn't able to attend the call today again unfortunately. Would properties related to this be added in L4 and above or would there be "properties of reproducibility" that are associated with levels 1-3 as they exist now as well?

I added my proposal and rationale to keep reproducibility separate from the build track to the document above.

slsa-framework / slsa

Semantic equivalency, reproducible builds, and a new "verifiable build" track #873