Package manager lockfiles

malt3 commented 10 months ago

Many of the projects around image based linux would benefit from having standardized package manager dependency lockfiles. I just created a proposal for the rpm / dnf ecosystem here: https://github.com/rpm-software-management/dnf5/issues/833

Benefits

Incremental builds

OS image builders like mkosi could read a lockfile as an input to decide if a (layer of an) image needs to be rebuilt. This makes incremental builds possible and would work really well to generate systemd sysext and similar formats.

Reproducible builds

The dependency lockfile would be an input to the image build. This allows tools like mkosi to always use the same set of pinned packages (rpms, debs, ...) instead of using the latest packages available via package repositories. If you want to perform reproducible OS image builds based on traditional package managers, having a lockfile or manifest is basically a requirement.

Bootstrapping a healthy dependency management ecosystem

As soon as you start pinning package manager packages using a lockfile, you are responsible to update the locked dependencies if a vulnerability is found. A lot of tooling and support is required for this to work well in practice. If we set standards for package manager lockfiles, this allows the whole ecosystem to build tools on top of that.

Supply chain security

This is basically a result of the other points: if you build image based linux distributions based on existing package manager systems, you'll want to know exactly what packages go into an image. Having lockfiles makes this process a lot simpler.

Possible implementations

This section is vague intentionally and should only give you a rough idea. I think the basic options are:

try to standardize on a single lockfile format that works for all package managers
try to standardize on one lockfile format for each package management system (deb, rpm, ...)

My feeling is that the second option is easier to implement in practice.

I'd be happy to receive feedback. Is this something the UAPI group is interested in tackling / standardizing?

alatiera commented 10 months ago

Having a manifest/lockfile as an output is great for having some idea of what the image contains indeed, but I am not so sure it's possible to standardize.

If it's just an output file you compare, having a list of components/sources/patches is not enough as you need a lot more things for reproducibility, like what compiler flags were used, configure arguments and prefixes, what the environment of the build process was and so on. And that's only about having an output file.

The dependency lockfile would be an input to the image build.

If you want to also make the lockfile the input, then it would mean that any given system using it would have identical input and producing identical output but doing it in its own way, at which point there would kinda be no point at all tbh. You would basically end up reimplementing the exact same buildorchestation/package-manager system in different ways, for clear to no clear benefit. What advantage would that get you? (And the output could be reproducible anyway with a single instance)

Like it would be basically:

input.lock -> rpmbuild orchestration thingy -> output binaries -> assert_eq(output_lock, input_lock) -> idk shove it into .rpms, OCI layers, w/e
input.lock -> debbuild orchestration thingy -> output binaries -> assert_eq(output_lock, input_lock) -> .debs or other format

Which would raise the question of why do we have (actually implement from scratch) N number of systems with identical input and output and what's the point of repackaging things afterwards since they are identical anyway?

If you know the code sources (git repos patches), the orchestration system definitions used (.spec files, debian/ w/e), and the version (or have the sources/binaries) of your build toolchain (rpm,dpkg) that's enough** reproduce a build. What extra advantages would there be by having rpm and deb be able to use and output the same format?

malt3 commented 10 months ago

Let me rephrase what I want:

The lockfile consists a set of allowed packages. Let's say a set of rpm files. What I want is an extension for package managers where the package resolution is deterministic.

So given dnf install --lockfile packages.lock <expression>, I want the command to always install the same set of packages.

To make this more concrete, let's split up the different phases a package manager performs:

parse the expression
optionally update the package index using remote repositories
find all requested packages using the expression and the package index
recursively find all transitive dependencies of the requested packages in the package index
perform the actual installation

In those phases, I want to ensure that resolved packages are also checked against the allowed set of packages in the lockfile. So the new algorithm would look like this:

parse the expression
optionally update the package index using remote repositories
find all requested packages using the expression and the package index. If any selected package is not in the lockfile, return error
recursively find all transitive dependencies of the requested packages in the package index. If any selected package is not in the lockfile, return error
perform the actual installation

DaanDeMeyer commented 10 months ago

The problem here is that official repositories generally only include the latest few versions of packages. So anything using a lock file and the official repositories would eventually stop building as the requested versions would not be available anymore. Why not keep around mirror snapshots and use those instead of the official repositories?

malt3 commented 10 months ago

I think there are many ways to preserve and access old packages, including keeping your own snapshot mirrors, using the ones provided by Debian, Arch, Redhat (RHEL, Fedora), vendoring and providing packages locally or using a form of content addressable storage to get all packages listed in a lockfile. So using the lockfile allows you to make simplified statements about the determinism of the package selection:

Either the install succeeds and the selected packages are chosen deterministically or the install fails. This can be very useful for correct caching / cache invalidation, supply chain security and reproducibility.

What I want to get at is that we should decouple the source of the packages from the benefits a lockfile can provide.

uapi-group / specifications