release-engineering / dist-git

DistGit provides home for linux distribution packages.
Other
123 stars 38 forks source link

[RFE] Identify more reliably upstream archive changes that have potential security implications #18

Closed nim-nim closed 6 years ago

nim-nim commented 6 years ago

Fedora has long tried to detect mismatches between the upstream archives uploaded by packagers in its build system, and the current state of the same archives published by upstream.

A mismatch can indicate a security event Fedora-side, an upstream silently "fixing" its code or, worse, fixing the result of an intrusion, or some other problem. There is no warranty whatsoever that the archive currently uploaded is the correct archive to use by Fedora, only that it looked good to the packager at the time.

Historical Fedora change detection mechanisms are centered around storing full-archive hashes.

Unfortunately those mechanisms are now invalidated by the move of many upstreams to hosting platforms, where archives are dynamically generated on-demand from an SCM state. In such a system there is no warranty whatsoever the archive hash will stay constant over time. The hosting platform can upgrade, for example, one of the components used to create archives : tar, gzip, bz2, xz, producing archives with different checksums, from the same scm content. It can change the way it names its archive or the topdir within, and so on.

As a result, continuing to use full-archive hashes as a change detection mechanism results in many false positives, deterring human packagers from actually investigating change events. This is bad since some of those events are indications of security problems upstream or Fedora-side.

Therefore it should be nice to upgrade the mechanism to something reliable in the face of dynamic archive generation, for example:

As long as the spec file is unchanged, only warn people, if the hash of the content above the topdir in one of the sourceX files changed:

clime commented 6 years ago

Hello nim-nim,

this is interesting but can you tell me where these checks for mismatches are actually being done? Is it in python-rpkg lib, rpmlint, or maybe rpmgrill? I am asking because:

1) I am not aware of these checks and would like to know more 2) this package does not contain anything like that

As far as I know, DistGit uses hashes as a dist-git source identifier, meaning that if you perform e.g. fedpkg sources and some sources are downloaded, fedpkg then can verify that the downloaded files have the requested checksums. That should verify nobody was able to pass something completely different to you during the download process.

I am not aware that those checksums (present in sources file) are supposed to have a relation to the upstream tarball checksums (even though it makes sense) and that it is checked somewhere. Can you tell me more about it - especially what tools are responsible for the checks you have mentioned.

nim-nim commented 6 years ago

Hello clime,

Right now you have various integrity checks in rpmlint and the lookaside cache (at least that was the checks people knew of).

I asked about the whole thing on #fedora-admin and people asked me to report there.

The problem is that when upstreams use hosting with dynamic generation, everything is mutable except the project files above the topdir. The filename can change (and we now have tooling to recompute the new filename without changing spec files, which means the same unchanged spec can point to a new SourceX filename after a few months if the hosting site relayouts), the topdir can change, the compression algorithm can change, and so on. So a simple filename/hash comparison does not work.

clime commented 6 years ago

Right now you have various integrity checks in rpmlint and the lookaside cache (at least that was the checks people knew of).

In rpmlint: https://github.com/rpm-software-management/rpmlint/blob/master/SpecCheck.py#L615 In fedora-review: https://pagure.io/FedoraReview/blob/master/f/src/FedoraReview/source.py#_111

Yes, there are some checks like this here and there. Good to know :).

The dist-git package itself does not contain such checks however, so I would suggest reporting the bug against individual tools.

The problem is that when upstreams use hosting with dynamic generation, everything is mutable except the project files above the topdir. The filename can change (and we now have tooling to recompute the new filename without changing spec files, which means the same unchanged spec can point to a new SourceX filename after a few months if the hosting site relayouts), the topdir can change, the compression algorithm can change, and so on. So a simple filename/hash comparison does not work.

I think it would actually be nice if the upstream provided a packed source always with the same checksum for a particular commit. Because we could then start relying on those sources much more.

If a source checksum changes for some reason, then the upstream sources should be re-uploaded into DistGit by a packager to avoid a warning. The goal here should be that this isn't happening because then we can automatically put more trust into the upstream "tarballs".

clime commented 6 years ago

I closed the issue because I cannot change anything here in the dist-git package itself. But if you have anything to discussion, feel free to contact me at #fedora-admin freenode IRC channel.

nim-nim commented 6 years ago

@clime You're basically writing “upstreams should provide archive with stable hashes” and “when that is not the case packagers should re-upload manually”. But the whole point of the issue is:

So please reconsider, the model where archives are immutable most of the times and manual reupload is sufficient is being invalidated by the evolution of hosting services. You need to move to in-archive content checksumming

praiskup commented 6 years ago

So please reconsider, the model where archives are immutable most of the times and manual reupload is sufficient is being invalidated by the evolution of hosting services.

It used to be a good policy to gpg-sign release tarballs, and never touch them again after release. Then, if you trusted your packager, and your packager trusted the upstream maintainer, everybody was fine. Nowadays we have to trust the tarball issuer (github and friends) .. and this request sort of looks like it means we want to invalidate by definition the chain of trust. The natural question coming to my mind is whether we "evolve" in a good direction.

You need to move to in-archive content checksumming

Btw. by "in-archive" you mean uncompressed content, or really the content an archive? Shouldn't this be rather checked by rpm (and the %setup macro, somehow, on demand), if any? In my opinion, we should stop using an word "archive", since we are not using it for "archiving" purposes anymore, and it is totally redundant ... we should concentrate on using git content directly (in RPM world) and checksum the (cloned?) content somehow.

nim-nim commented 6 years ago

So please reconsider, the model where archives are immutable most of the times and manual reupload is sufficient is being invalidated by the evolution of hosting services.

It used to be a good policy to gpg-sign release tarballs, and never touch them again after release.

Nowadays release tarballs don't exist in many cases, devs got addicted to git and are not interested in working from tarballs. So you have tarballs-on-demand, regenerated from git with the current archiver tools deployed by the hoster.

Then, if you trusted your packager, and your packager trusted the upstream maintainer, everybody was fine. Nowadays we have to trust the tarball issuer (github and friends) .. and this request sort of looks like it means we want to invalidate by definition the chain of trust. The natural question coming to my mind is whether we "evolve" in a good direction.

The chain of trust does not need to be changed at all, what changes is that its focus is getting narrowed to "the files in the archive above topdir" instead of "the archive file as a whole" (whatever git would use to hash to identify a whole-project code state). I'm not asking to trust a third party with the checksum, just to change what dist-git computes and records as indicator of source integrity.

You need to move to in-archive content checksumming

Btw. by "in-archive" you mean uncompressed content, or really the content an archive?

I mean that to get a reliable indicator whether a project has been tampered with or not, you need to uncompress the archive, switch to the topdir and checksum the project files above the topdir. And accept everything that matches that checksum as functionality and security-wise identical even if the file or topdir have a new name, even if the archive switched from tar.gz to tar.bz2 to tar.xz, even if it kept the old tarball method but compressed to a new binary due to changes in the gz bz2 or xz compressor.

Shouldn't this be rather checked by rpm (and the %setup macro, somehow, on demand), if any?

Not possible unless something feeds it a content checksum to test against. Something being dist-git since that's the layer that's supposed to record what makes authority for Fedora packages

In my opinion, we should stop using an word "archive", since we are not using it for "archiving" purposes anymore, and it is totally redundant ... we should concentrate on using git content directly (in RPM world) and checksum the (cloned?) content somehow.

It would be functionnaly identical to checksumming clone, without pulling in the whole .git metadata and history (which is a whole other can of worms due to the use of ssh as transport in many cases and the fact the .git metadata and history can contain files under previous not approved by Fedora licenses), and without needing to distinguish between tarballs-generated-from-git and tarballs-generated-manually.