cgwalters commented 3 years ago

Optimizing ostree-native containers into layers

This project started by creating a simple and straightforward mechanism to bidirectionally map between an ostree commit object and an OCI/Docker container - we serialize the whole commit/image into a single tarball, which is then wrapped as a container layer.

In other words, we generate a base image with one big layer. This was simple and straightforward to implement, relatively speaking. (Note that derived containers naturally do have distinct layers. But this issue is about the "base image".)

The core problem is that any change to the base image (e.g. kernel security update) requires each client to redownload the entire base image which could be 1GB or more. See e.g. this spreadsheet for Fedora CoreOS content size.

Prior art and related projects

See https://grahamc.com/blog/nix-and-layered-docker-images for a Nix based build system that knows how to intelligently generate a container image with multiple layers.

There is also work in the baseline container ecosystem for optimizing transport of containers.

Copying in bits of this comment:

https://www.balena.io/docs/learn/deploy/delta/ is a custom fork of docker with custom deltas that has been around a while
https://github.com/containerd/stargz-snapshotter/blob/main/docs/estargz.md is a bit more recent
https://github.com/containers/image/pull/902 is bsdiff for layers
https://www.scrivano.org/posts/2021-10-26-compose-fs/ and zstd chunked
https://github.com/linuxkit/linuxkit changes the default OS to be a pile of containers (very interesting direction)

estargz has some design points worth looking at, but largely speaking I think few people want to have their base operating system lazily fetched. (Although, there is clearly an intersection with estargz's per-file digests and ostree)

Proposed initial solution: support for splitting layers

I think right now, we cannot rely on any container-level deltas. As a baseline, we should support splitting ostree commits into distinct layers, because that works with every container runtime and registry today. Further, splitting layers optimizes both pulls and pushes.

In addition, split layers means that any other "per layer" deltas actually just work more efficiently and better, whether that's zstd-chunked or layer-bsdiff.

Aside: today with Dockerfile, one cannot easily do this. Discussion: https://github.com/containers/podman/discussions/12605

Implementing split layers

There is an initial PR for this here: https://github.com/ostreedev/ostree-rs-ext/pull/123

Conceptually there are two parts:

Generating split layers

There are multiple competing things we need to optimize. A first theoretically obvious thing to do is have a single layer per e.g. RPM/deb package. But we can't do this because there are simply too many packages and container limits are around 100. See the above nix-related blog.

And even if we wanted to go close to a theroetical limit of 100, because we want to support people generating derived images, we must reserve space for that. At least 30 layers. Conservatively, let's say we shouldn't go beyond 50 layers.

In the general case, because ostree is a low-level library, we need to support higher level software telling us how to split things - or at a minimum, which files are likely to change together.

Principles/constraints:

Content which is large and/or frequently changing should be in a layer
Conversely, small and infrequently changing content could be grouped together
We shouldn't get more than N layers, for N=50 or so

The "over time" problem

We should also be robust to changes in the content set itself. For example, if a component which was previously small grew large, we want to avoid that change "cascading" into affecting how we change multiple layers. Another way to say this is that any major change in how we chunk things also implies clients will need to redownload many layers.

Initial proposal

In the initial code, ostree has some hardcoded knowledge of things like /usr/lib/firmware from linux-firmware, which is by far the largest single chunk of the OS. It is large, and while it doesn't change too often, it clearly makes sense to chunk by itself.

The current auto-chunking logic also handles the kernel/initramfs in /usr/lib/modules.

Past that, we have code which then tries to cherry pick large files (as a percentage of the total remaining size) which captures things like large statically linked Go binaries.

Supporting injected mappings

What would clearly aid this is support for having e.g. rpm-ostree inject the mapping between file paths and RPM name - or really most generically, assign a stable identifier to a set of file paths. Also supporting some sort of indication of relative change frequency would be useful.

Something like:

struct ContentSource {
    change_frequency: u32, // Percentage between 1-100 or so?
}
// Maps from e.g. "bash" or "kernel" to metadata about that content
type ContentSet = BTreeMap<String, ContentSource>;
// Maps from an ostree content object digest to e.g. "bash"
type ContentIdentification = BTreeMap<Checksum, &str>;

Some interesting subtleties here around e.g. "what happens if a file from different packages/components was deduplicated by ostree to one object?". Probably it needs to be promoted to the source with the greatest change frequency.

Current status

See https://quay.io/repository/cgwalters/fcos-chunked for an image that was generated this way. Specifically, try viewing the manifest and you can see the separate chunks - for example, there's a ~220MB chunk just for the kernel. If a kernel security update happens, you just download that chunk and the rpm database chunk.

cgwalters commented 3 years ago

There's obviously nothing ostree specific about this really. It should be possible to write a tool which accepts an arbitrary OCI/Docker image and reworks the layering such that e.g. large binaries are moved into separate content-addressed layers - right?

Maybe it's a bit ostree specific in that it would be most effective when applied to a "fat" operating system that has lots of embedded binaries from multiple sources.

In contrast a simple standard multi-stage build that links to a base image is already fairly optimal. It's doubtful that it'd be valuable to split up most standard distribution base images. (But that'd be interesting to verify)

mheon commented 3 years ago

There are issues with splitting into too many layers. I recall hearing about performance suffering as layer count increased, but I can't recall details and that's likely solvable. If it does increase too much, however, you run into some fairly fundamental limits on registries and in many image-handling libraries. Some of our source image work produced images with >256 layers, and those were effectively unusable - the kernel parameter for the Overlay mount was simply too large to pass to the kernel, for example.

cgwalters commented 3 years ago

Thinking about this more, a notable wrinkle here is things like the dpkg/rpm database, which today are single-file databases. That means that when one writes such a tool to split out e.g. the openssl or glibc libraries into their own content addressed layer, the final image layer will still need to contain the updated package metadata.

This is another way of noting that single file package databases defeat the ability to do true dynamic linking - i.e. when just glibc changes we still need to regenerate the top layer to update the database, even if we had container-level dynamic linking. I think NixOS doesn't have such a single file database? But I haven't been able to figure it out.

mkenigs commented 3 years ago

NixOS does have a database: https://nixos.org/guides/nix-pills/install-on-your-running-system.html#idm140737320795232

I've discussed trying to work around the layers issue with @giuseppe in order to allow sharing packages between host and guest as well as between containers.

My best current shot at a solution is to have packages installed somewhere central (like /nix/store), and then all the necessary packages could be reflinked into a single container layer's /nix/store. To maintain compatibility, everything in the container's /nix/store could then get hardlinked to the usual paths on the root filesystem. To see how much space that would use, I tried reflinking and hardlinking on XFS, and it only takes 3% of the blocks of the original files to create the reflinked/hardlinked tree. Did that here: https://github.com/mkenigs/reflinks

cgwalters commented 3 years ago

I've discussed trying to work around the layers issue

Can you elaborate on "layers issue"?

with @giuseppe in order to allow sharing packages between host and guest as well as between containers.

Let's avoid the use of the term "guest" in the context of containers as it's kind of owned by virtualization. The term "host" is still quite relevant to both, but it's also IMO important that containers really are just processes from the perspective of the host.

Unless here you do mean "host" and "guest" in the virtualization sense?

Now I think let's be a bit more precise here and when you say "containers" you really mean "container image layers", correct? (e.g. "image layers" for short).

My best current shot at a solution is to have packages installed somewhere central (like /nix/store), and then all the necessary packages could be reflinked into a single container layer's /nix/store.

This seems to be predicated on the idea of a single "package" system, whether that's nix or RPM or dpkg or whatever inside the containers. I don't think that's going to realistically happen.

mkenigs commented 3 years ago

Can you elaborate on "layers issue"?

Sorry the issue of having too many container layers leading to decreased performance.

Unless here you do mean "host" and "guest" in the virtualization sense? Now I think let's be a bit more precise here and when you say "containers" you really mean "container image layers", correct? (e.g. "image layers" for short).

Yep hopefully this is more precise: I mean sharing between host and container image layers and between different container image layers (if some sort of linking within the layer is used) or between containers (if the entire layer is shared)

This seems to be predicated on the idea of a single "package" system, whether that's nix or RPM or dpkg or whatever inside the containers. I don't think that's going to realistically happen.

I don't think it has to - just think of it as subdividing OCI layers into noncolliding chunks with checksums. For our purposes, it would probably be easiest if each chunk was a single RPM, and that would make sharing with rpm-based hosts easier. But the single packaging system is already provided by containerization - we already have /var/lib/containers/storage, and we already store unique layers based on checksums there. This would just be an optimization marking part of a layer as a chunk that can probably be deduplicated and shared with the host. I don't fully understand it, but I think @giuseppe's current approach for zstd:chunked requires a lot more searching for files to deduplicate (https://github.com/containers/storage/blob/48e2ca0ba67b5e7b8c04542690ac4e5d6ec73809/pkg/chunked/storage_linux.go#L310). Whereas this would just involve checking if /var/lib/containers/storage/some_chunk_checksum exists

cgwalters commented 3 years ago

I am still very wary of intersecting host and container content and updates. That's not opposition exactly, but I think we need to do it very carefully and thoughtfully. I touched on this in https://github.com/containers/image/pull/1084#issuecomment-829576636

But the single packaging system is already provided by containerization - we already have /var/lib/containers/storage, and we already store unique layers based on checksums there.

(I wouldn't quite call this a "packaging system"; when one says "package" I think more of apt/yum)

Ultimately I think a key question here is:

Does the container stack attempt to dynamically (client side) perform deduplication, or do we try to create a "postprocessing tool" that transforms an image as I'm suggesting? Or, a variant of this is we do something like teach yum to write the pristine input RPM files to /usr/containers/layers/$digest, and then podman build --auto-layers looks at the final image generated, and then re-works the layering to have one blob for each entry in /usr/containers/layers/ or something?

edit1: Actually I think operating under the assumption there's no really good use case for having higher levels fully remove lower levels, podman build --auto-layers could actually do this incrementally as a build progresses too.

edit2: Actually we can't sanely make a tarball inside the container without duplicating all the space (barring reflink support). So it'd probably need to be /usr/containers/layers/<fstree>.

mkenigs commented 3 years ago

I am still very wary of intersecting host and container content and updates. That's not opposition exactly, but I think we need to do it very carefully and thoughtfully. I touched on this in containers/image#1084 (comment)

That makes a lot of sense. Would you say there's already a hole in the firewall between containers because of shared layers? Would doing this make that hole worse, or is it the same hole?

Ultimately I think a key question here is:

...

Agreed

mkenigs commented 3 years ago

Returning to the issue of too many layers decreasing performance: @mtrmac I can't remember for sure if it was you who brought up performance overhead of too many layers when we discussed it this summer, but am I correct that you don't think it's solvable? Just checking because of @mheon saying that it's likely solvable

mtrmac commented 3 years ago

I’m afraid I really can’t remember if I brought up something like that or if so, what it was. I’m not aware of layer count scaling horribly, at least.

Do note I read @mheon ’s comment above as performance being solvable up to the ~128-layer limit for a container. Right now, I wouldn’t recommend any design that can get anywhere close to that value, without having a way forward if the limit is hit.

(That 128-layer limit is also why I don’t think it’s too likely we would get horrible performance: there would have to be an exponential algorithm or something like that for O(f(N)) to be unacceptable with N≤128; at these scales, I’d be much more worried about scaling to large file counts, in hundreds of thousands or whatever.)

mkenigs commented 3 years ago

@mtrmac thanks!

Do note I read @mheon ’s comment above as performance being solvable up to the ~128-layer limit for a container. Right now, I wouldn’t recommend any design that can get anywhere close to that value, without having a way forward if the limit is hit.

The article @cgwalters linked discusses how to combine layers once that limit is hit

cgwalters commented 3 years ago

Some initial code in https://github.com/ostreedev/ostree-rs-ext/pull/123

cgwalters commented 3 years ago

Some of our source image work produced images with >256 layers, and those were effectively unusable - the kernel parameter for the Overlay mount was simply too large to pass to the kernel, for example.

One thing I'd note here: I think we should be able to separate logical layers from physical layers. Not every layer needs to be a derivation source, i.e. a source for overlayfs.

I think it significantly simplifies things if we remove support for whiteouts etc. in these layers. IOW, the layers are "pure union" sources. That means that given layers L1, L2, L3 that are known to be union-able and also not a source, instead of using overlayfs at runtime, the layers are physically re-unioned by hardlink/reflinking. (Perhaps a corollary here really is that these union-only layers should also not care about their hardlink count)

cgwalters commented 3 years ago

I wanted to comment on https://www.scrivano.org/posts/2021-10-26-compose-fs/ and https://github.com/giuseppe/composefs briefly.

Broadly speaking, I think we're going to get a lot of mileage out of better splitting up container blobs as is discussed here. For one thing, such an approach benefits container runtimes today without any kernel/etc changes.

(I still have a TODO item to more deeply investigate buildkit because I think they may already do this)

One major downside of ostree (and this proposal) is that garbage collection is much more expensive. I think what we probably want is to at least only do deduplication inside related images to start. It seems very unlikely to me that the benefits of e.g. sharing files across a rhel8 image and a fedora34 image are worth the overhead. (But, it'd be interesting to be proven wrong)

mkenigs commented 3 years ago

The only issue that post identifies with overlayfs is that it can't deduplicate the same file present in multiple layers, right? Would there be any advantage to composefs compared to overlayfs with files split into separate layers? Would it scale better for large numbers of layers?

cgwalters commented 3 years ago

I once saw an overlayfs maintainer describe it as a "just in time" cp -a. So I wouldn't say this is a limitation of overlayfs - it's how the container stack is doing things (using overlayfs). Which in turn I think is somewhat driven by the compatibility constraint of having st_nlink inside a container image apparently match only files inside that container image as Guiseppe's post touches on.

That said, I suspect the vast majority of images and use cases would be completely fine seeing a higher st_nlink for e.g. glibc. I mean, ostree has been doing this for a long time now. If it's just tools like e.g. rpm -V saying "hmm glibc has a higher st_nlink than I expected", I think we should just fix those to accept it (optionally detecting that they're inside a container).

mkenigs commented 3 years ago

Isn't another st_nlink concern security? e.g. being able to detect what version of glibc the host is running.

cgwalters commented 3 years ago

Isn't another st_nlink concern security? e.g. being able to detect what version of glibc the host is running.

Well...containers (by default) know which version of the kernel the host is running, which I would say is far more security sensitive. But OTOH, generalizing this into leaking any arbitrary file (executable/shared library) does seems like a potentially valid concern.

mkenigs commented 2 years ago

Not sure if comments about sharing layers between the host and container image layers belong on a separate issue, but if that was possible would experimental-image-proxy (https://github.com/containers/skopeo/pull/1476) be unnecessary? Since we would want containers/image involved for anything we pull for the host

Wondering since we'll be using experimental-image-proxy for https://github.com/ostreedev/ostree-rs-ext/issues/121

cgwalters commented 2 years ago

Not sure if comments about sharing layers between the host and container image layers belong on a separate issue, but if that was possible

I think that is probably a separate issue. It's certainly related to this, but it would be a rather profound change from the current system architecture.

One thing I do want to make more erogonomic though is pulling from to containers-storage: - that relates to https://github.com/ostreedev/ostree-rs-ext/issues/153 And we should clearly support writing via the proxy too (no issue tracks that yet).

cgwalters commented 2 years ago

Only tangentially related, I came across https://github.com/google/go-containerregistry/issues/895#issuecomment-753521526 which is a good post.

ostreedev / ostree-rs-ext

container: support splitting inputs #69

Optimizing ostree-native containers into layers

Prior art and related projects

Proposed initial solution: support for splitting layers

Implementing split layers

Generating split layers

The "over time" problem

Initial proposal

Supporting injected mappings

Current status