Project maintainer package vs. Distribution package

mlieberman85 commented 2 years ago

SLSA currently doesn't provide guidance or elaborate on the distinction between the things that are being built and packaged and the how the packaging itself is being maintained.

For example: A python package that is included in a linux distribution will have two sets of "source" to deal with. First is the actual python code, and second is the packaging code as maintained by the linux distribution itself.

This differs from how a package built and distributed by the package maintainers themselves might work since that will have just a single set of maintainers and the code is for all intents and purposes coming from one place. When you have a distribution package there are often patches and other things that modify the upstream code.

How do we want to manage this sort of thing?

david-a-wheeler commented 2 years ago

This was discussed at the 2021-12-01 SLSA meeting. Thanks for raising this! This is a challenging issue.

Here are a few notes from that meeting:

Distributors often take a source tree, make additions/changes, build something, & release that
Mark Lodato: The same issue happens when a company creates a local distribution
David A. Wheeler: At the least (“floor”), if a distribution package claims SLSA compliance, the distribution package development/release process must implement the SLSA requirements.
Problem: If a distribution package has a high SLSA level but is creating a package of a source code from a no/low SLSA level, this could be really confusing to users.
1. Idea: Maybe we require that distribution packages say their level AND “on” the SLSA level of their source, e.g., “SLSA level 3 on 1” means this distributor does SLSA 3, but they’re packaging source at SLSA 1. Could recurse if it’s re-repackaged “3 on 1 on 1”.
2. Idea: Take the minimum, simpler to understand [but gives less insight]
3. Idea: Consider the distribution packagers + upstream for that package as if they were a single organization, and measure SLSA that way
Question: How transitive should this be?

Others, please add your thoughts.

david-a-wheeler commented 2 years ago

I think the underlying challenge is to be clear and NOT mislead.

As was discussed on the call, many distribution packages take the source code, make some minor changes, & then release it. The distributors often don't review every line of code (though that does happen, especially when they're the same people). The distributors add code that the upstream didn't see. That makes describing this challenging.

TomHennen commented 2 years ago

So it sounds like the first questions to answer is "how should artifacts be labeled?", once we have an idea of what we want there we can thinking about the technical approach?

mlieberman85 commented 2 years ago

Some additional thoughts. I think we should not fall down the transitive rabbit hole right now, just want to make sure that we're clear what folks are attesting to.

I think one key question is do we have a way of clarifying what type of artifact the SLSA attestation is attesting to?

Even among distributors whether it's linux distro, stuff like homebrew/choco or something else, there's different approaches on how they might modify things and package them up.

david-a-wheeler commented 2 years ago

This is an important & challenging problem. Here are my current thoughts, but I very much welcome alternative thoughts.

We need to carefully distinguish between (1) bringing in (transitive) dependencies of various reused sub-components and (2) repackaging an existing program. We've already agreed that in case (1), reused sub-components would have separate SLSA levels (otherwise it's almost impossible to have meaningful SLSA levels for most software). Here we are focuses on case (2), repackaging an existing program (where all agree it's mostly the "same" program but with some changes).
After thinking for a few hours after our meeting today, I'm currently leaning towards requiring that the SLSA level be applied to distribution packages as if the entire project was the upstream maintainer project + the distribution combined. The upstream maintainer project might have a SLSA level for just itself when people download its package direction, while the distribution package might have a different SLSA level because it represents the combination. This "systems thinking" approach is at least intellectually defensible.
I'm not sure how easy it would be to apply this "systems thinking" approach, especially since the devil is in the detail. Maybe we could pick two distro packages & try to do the analysis. I suspect the SLSA text will need some subtle changes so that it'll be clearer how to do this analysis, & I suspect the only way to identify them is to try in the first place. It may be too hard, but I think we'll only know by trying to do the analysis, and I have hopes that we can work the problems out.

What do others think?

TomHennen commented 2 years ago

I suppose I don't know enough about how these distributions work. Is it fair to say that the materials used look something like:

a. upstream 'official' project source repo @ \ b. distribution patches & build instructions source repo @ \

? (I think this is what @mlieberman85 was describing above...)

The way I'd been planning to evaluate the level involves computing the SLSA source level as the MIN of all source repos listed in the materials. So in the case I describe here the package produced by the distribution couldn't get a higher SLSA level than whatever its source repo qualifies for in the first place. This approach doesn't require solving any of the transitivity problems.

Now, this all breaks down if the distribution gets the upstream source from some tarball...

david-a-wheeler commented 2 years ago

Now, this all breaks down if the distribution gets the upstream source from some tarball...

I don't think it "all breaks down" - not at all. Tarballs are easy to sign & verify. Sure, there's the risk that a posted tarball doesn't match the source downloaded via git, but that's also true for built packages from a maintainer, for exactly the same reasons. If you want to counter those kinds of attacks, you need verified reproducible builds (including tarballs, because they are generated files). I think I've said that before :-). The git tool happily supports tarballs (it can generate them with git archive). GitHub directly supports tarball generation. GitLab also directly supports tarball generation.

In any case, tarballs are the normal case, at least for rpm-based (Fedora, CentOS, Red Hat Enterprise Linux (RHEL), etc.) and deb-based (Debian, Ubuntu) packages. I've created multiple packages for Fedora, & I doubt they've changed that recently.

For example, the Fedora instructions for creating an rpm package require specification of:

URL: The full URL for more information about the program. For example, the project website.
Source0: ... the full URL for the compressed archive [tarball] that contains the original, pristine source code, as upstream released it. “Source” is synonymous with “Source0”... Preserve the timestamps when downloading source files. If there is more than one source, name them Source1, Source2.
Patch0: Enter the name of the first patch to apply to the source code. ... Patches must make only one logical change each, so it’s quite possible to have multiple patch files.

Here's how to create packages for Debian. As noted in the intro to Debian packaging, the key construct is the upstream tarball; "A tarball is the .tar.gz or .tgz file upstream makes, (can also be in other compression formats like .tar.bz2,.tb2 or .tar.xz). Contains the software upstream developer has written."

As far as I know these work quite happily with git. Historically there were some challenges integrating GitLab tarballs with Fedora packaging, but those seem to have been fixed years ago.

axelsimon commented 2 years ago

For a start, i think the most important part here is for the level information to be transparent, or as @david-a-wheeler put it, not misleading. This is really the core of it, for as long as users know what they are getting, they can keep trusting SLSA levels to mean something.

Without either wanting to go too far down the transitivity rabbit hole, two analogies came to mind since the meeting (car analogy time!). A car that passes a certain safety standard (SLSA level 4) is understood to be built from components that themselves pass safety standards. No one expects a SLSA level 4 car to be built from parts of unknown safety levels. For those who don't like car analogies, this also works with a house: level 4 house made of level 0 or 1 doors and walls is not really a level 4 house, in common understanding. This is to say: the non-transitivity of SLSA level might be defensible, but it needs to be stated much more clearly. Currently, it's mentioned on the levels page, but i think the warning should be bigger.

This is also a big difference between a/ Google being knowledgeable the of components they build their systems from and knowing they are, say, SLSA 1, but determining that their build process rigorous enough that they get SLSA 4 artefacts in the end, and b/ end users of say, RHEL, who trust Red Hat to provide a fully supported product and would assume (rightly, i would argue) that if that product is SLSA level 4, then just like a car that passes safety standards, its components are also level 4.

I think it boils down to how the levels of due diligence, or the onus of due diligence (real or assumed) are different when using SLSA for one's own organisation or when using it to determine safety / security properties of a vendor's product.

david-a-wheeler commented 2 years ago

@axelsimon - I agree that non-transitivity needs to be made very clear. If you think it's not clear enough, would you please create a separate issue about that, & explain why? Or create a pull request to fix it? I think this this issue (#235), which is focused on "Project maintainer package vs. Distribution package", is potentially quite complex. I'm concerned that your different point, about making non-transitivity clear, may get lost in the discussion.

MarkLodato commented 2 years ago

IMO, the core problem is that our supply chain model is too ambiguous: it does not provide enough instruction to the reader to know how to label something as "source" vs "dependency". The current wording says:

Source: Artifact that was directly authored or reviewed by persons, without modification. It is the beginning of the supply chain; we do not trace the provenance back any further.

Dependency: Artifact that is an input to a build process but that is not a source.

Let's use the Debian "curl" package as an example. Here's the set of artifacts:

Upstream repo: https://github.com/curl/curl/
Tarball: https://curl.se/download/curl-7.64.0.tar.gz
Debian's mirror (with patches): https://salsa.debian.org/debian/curl
Debian source package: curl_7.64.0.dsc
Debian source patches: curl_7.64.0-4+deb10u2.debian.tar.xz
Debian binary package: curl_7.64.0*.deb

Two options seem appealing to me:

Remove all transitivity. Every artifact has its own level. So in the example above, there would be 6 independent levels. This is straightforward but less useful in isolation. It would increase the need an aggregate measure.
Redefine the model so that it is transitive but only on "sources", where the build system identifies what its sources are. That is:
- If the artifact is a version control revision, it must meet level X source reqs (e.g. two party review).
- If the artifact is "built", it must meet level X build reqs + all "sources" recursively meet level X.

Sorry, this was a poor explanation, but it's the end of the day and I need to go. I'll try to come up with something more concrete.

mlieberman85 commented 2 years ago

I am very much in favor of non-transitive but making sure it's clear to the end user. I think we should leave the claims being made into the hands of the ones making the claims. By that I mean the one signing the attestation should claim as much or as little as they can and it should be up to the end user of those claims to determine how comfortable they are with those claims.

I do think this opens up some questions around what SLSA actually means, but as an end user almost entirely consuming SLSA attestations I'm OK with providers of the attestations claiming very little (e.g. just created a tarball from the source files) to claiming a lot (e.g. patched vulns and generated a hardened compiled artifact) as long as it's clear to the end user.

bureado commented 2 years ago

To add to @MarkLodato excellent example, I suggest thinking about the ingestion time scenario. When you apt install nano, where and how does SLSA kick-in? Let's say I have some form of client-side policy that says "only allow install for SLSA 4 artifacts". How does that materialize for the user? The sky is the limit, but I think the reasonable think to expect is that the SLSA of the .deb will be inspected. There will be a provenance attestation in a store somewhere, made by the e.g. Debian maintainers of the package, that basically says - look, that .deb came from this .dsc. And we did/did not check SLSA for the upstream tarball. Or we did check SLSA for the upstream tarball, but maybe not for a particular patch we had to import from Fedora where we couldn't validate the "strong authentication" requirement. So we only attest to the SLSA of the .deb and to the patches we've put in debian/patches. Something like that.

I guess I'm agreeing with non-transitive, and adding that I don't think each artifact type (package type) can possibly carry the SLSA of everything upstream from it. That sounds like it's an optional metadata attribute of the provenance attestations in the store, or you just leave it as an exercise for organizations to go dereference independently. For example, when I apt install nano, I want to check for things like:

Which repository is it pulled from? (e.g., no PPAs)
Is that source package reproducible?
If so, is there hash agreement across rebuilders?
Where was it built and how? (Can I read .changes, .buildinfo and build logs?)
Where can I browse the "materialized" distro sources? (e.g., salsa.debian.org, sources.debian.org)
Where does the package differ from upstream tarball? (e.g., inspect .diff.gz, Debian-specific patches)
What does the package do? (Declared metadata, as in debtags; as well as anything you can get from a sandbox; can you run the test suite on the source package with a syscall recorder?)
What do I know about its dependencies and build dependencies?
Does the package have information about its upstream (terminal) sources? (VCS URL, a watch regex, etc.)
With that information, can I go to OpenSSF Metrics and get criticality score and other security facts about the project? (If not, can I assemble a purl or a CPE and get information from public sources on this.)
In the upstream (terminal) sources, are there vendored dependencies that I don't know of and aren't declared anywhere else?

I don't think SLSA can codify all of this, which is going to be slightly nuanced by package ecosystem. And in some cases it will mix sources and binaries. And in some cases it will cross different artifact types. Or even proprietary and open source software. So in summary I'm in favor of non-transitive. But even that can be a bit difficult to implement. Granted, it has only been a thread or two on Twitter, but sounds like people naturally are expecting to say things like "Distro foo is SLSA 3". How do you compute the level for a distro that might have tenths of thousands of packages? You can't possibly compute the transitive set of all the packages and their upstreams and their patches. The lower common denominator might also be quite underwhelming. So here's where I think we might end up with a simple per-package attribute. Today, in general, you can tell if a certain package in a certain release (channel) of a distro is/is not reproducible. Maybe distros can have a program where they say: look, for our next release, we think we can aim for SLSA 4. 50% of our packages are there already. We can bring that to 85%. We'll announce our release and say 85% of it is SLSA 4. And packages that aren't, don't get that attribute for now. It's just an idea - and if the group thinks it's useful, hopefully a seed to discuss with the actual distros.

Possibly useful reference for the Debian ecosystem: https://trends.debian.net/#version-control-system and https://trends.debian.net/#source-formats-and-patch-systems

mlieberman85 commented 2 years ago

I by and large agree. I think we're on the same page but want to elaborate a little bit. I agree that any individual SLSA attestation can't be transitive but we still need a good way of referencing where possible. e.g. if I have a docker image with a golang app and it's based on Debian. It might look something like:

SLSA attestation for image is in OCI registry.
That SLSA attestation would refer to elements used to build the image like the base Debian image, the golang binary, etc.
You should then be able to look up SLSA attestations for base Debian package either in its OCI registry or another data store you trust.
You should be able to look up golang binary attestations in artifact storage, transparency log or other data store you trust.

One of the key things I think we need to make sure we address though is it's important to know what is being referred to in the subject and materials in a particular attestation so I can both better understand what is being attested (subject) and the materials associated as well as being able to infer how to further query for attestations.

To go back to the previous example, if I know one of the materials is a container image I can query to see if attestations live in the same repo. I can't do that if it's a jar file living in maven central right now. I would currently need to look for the attestation in a transparency log or other data store.

bureado commented 2 years ago

Possibly related, #107

MarkLodato commented 1 year ago

Still think this is valuable, but given the tight deadline for 1.0, it should be OK to fix after the release of 1.0

slsa-framework / slsa

Project maintainer package vs. Distribution package #235