osbuild / bootc-image-builder

A container for deploying bootable container images.
https://osbuild.org
Apache License 2.0
131 stars 54 forks source link

Support custom mountpoints #579

Open achilleas-k opened 3 months ago

achilleas-k commented 3 months ago

Opening this issue to track support for custom mountpoints.

@mvo5 described the issue with custom mountpoints in 06e1b2a67abea54425ea4d36cb64a5d2d988af1e.

Short version: bootc needs an empty root tree to install to when running bootc install to-filesystem. With our current pipelines, when we build an image, we format the disk with all the partitions, mount every mountpoint to its location under a root tree, and call bootc install to put its files in the fully mounted root tree, which will be non-empty if it contains directories for custom mountpoints.

I tested the idea in the commit message

After "install-to-filesystem" ran we need a "org.osbuild.mkdir" stage for the extra mount points that also only mounts the "essential" mounts.

and it works as expected, with some caveats:

  1. Custom mountpoints cannot be created on / (the deployed root, not the physical disk root), since after bootc install, it is marked immutable.
  2. Custom mountpoints cannot contain any data on first boot.
  3. If a custom mountpoint is configured for a path that exists in the base image and is non-empty, the mountpoint will cover the data.

Important note: Some of my tests were "simulated", meaning I scripted or manually intervened to do what osbuild would be doing without actually using a stage, but the behaviour should be the same.

For example, I'm considering the following scenario:

FROM quay.io/centos-bootc/centos-bootc:stream9

RUN mkdir -p /opt/myapplication/log
RUN date > /opt/myapplication/log/build-time
[[customizations.filesystem]]
mountpoint = "/opt/myapplication/log"
minsize = "20 GiB"

With our proposed solution, bootc will create a filesystem that contains /opt/myapplication/log/build-time, but on boot the path /opt/myapplication/log will be shadowed by the new mountpoint. If there's no way to support this scenario, we should probably inspect the image (which we already do when preparing the manifest) and error out.

Questions (cc @cgwalters):

cgwalters commented 3 months ago

Aside but an important one for me: It'd be really, really, really, really nice if we can share code with Anaconda too. All of these things need to be supported there too.


For toplevel mountpoints, the expectation is that the directory is part of the container image; we should simply fail if a target mountpoint does not exist in the container image.

I'm told we have users that would like to create mountpoints on /. Can we do this? If we flip the immutable flag, create a directory for the mountpoint under /, and flip it back on, will that be enough or will there be unwanted side effects?

The immutable flag isn't used with composefs, there's currently no "hack" to mutate the rootfs at runtime (without making it globally writable). See also https://gitlab.com/fedora/bootc/tracker/-/issues/26 which is tracking some support for that. But I don't think we want that for disk image defaults, again I think it's basically that the directory should be owned in the container.

Do we want to support a scenario where a base image contains data destined for a custom mountpoint?

It has never been supported to "split" ostree/bootc content across multiple filesystems, I think if you try you'll get EXDEV when ostree tries to hardlink. The only thing that will work is to have /var or a subdirectory thereof, and especially doing all of /var today is a bit tricky to handle unfortunately. I think what we could probably do is add bootc install to-filesystem --var=<mount spec> or so.

Tools like dpkg/rpm/etc support splitting their content in this way, but they only have one copy of content. As ostree/bootc wants to support multiple versions (and are based around a shared/deduplicating backing store) it's a lot harder. https://github.com/containers/composefs/issues/125 touched on some of this, but basically not going to happen anytime soon.

So what we should do (to combine these two things) is:

achilleas-k commented 3 months ago

So what we should do (to combine these two things) is:

  • Require the mountpoint exist in the container
  • Error if it's not empty

I think this simplifies the initial implementation enough to make it work quickly. It does sort of tie the base image to the configuration (mountpoint in base image + bib build config), but that's probably fine.

Does it make sense though to relax the first rule for some paths? IIUC, there's no harm in creating a /var/data directory after running bootc install and mounting a separate filesystem there without /var/data existing in the base image, is there?

achilleas-k commented 3 months ago

It'd be really, really, really, really nice if we can share code with Anaconda too.

I'm sorry if I'm failing to see something obvious but this keeps coming up and I'm still not clear what code we could share. Anaconda operates in a very different environment than osbuild. Also, osbuild stages are (usually) thin wrappers around system utilities. We're currently talking about adding 2-3 stages that essentially do:

mkdir <mountpoint>  # or maybe we wont do this
sfdisk <device> <long sequence of partitioning commands>

and then write a line in fstab for mounting the filesystem to the mountpoint.

What is there to share? Are we talking about importing python modules shared with anaconda for shelling out to binaries in a consistent way?

The fstab stage a 50 line python script.

The disk partitioning is very different in the two cases. While the osbuild sfdisk stage is quite large, most of that is transforming the partition table description to an sfdisk script to run against the disk, because we need a precomputed description of the partition table before we start building.

cgwalters commented 3 months ago

It does sort of tie the base image to the configuration (mountpoint in base image + bib build config), but that's probably fine.

Yeah; this is one reason why I was arguing to support embedding partitioning information in the image itself (it's also what systemd-repart is aiming for, though we have use cases beyond what that tool does).

Does it make sense though to relax the first rule for some paths? IIUC, there's no harm in creating a /var/data directory and mounting a separate filesystem there without /var/data existing in the base image, is there?

Yep, subdirectories of /var are totally fine. Though note that the default for .mount units is to create the directory - so all that osbuild (or anaconda) need to do here is set up the desired filesystem. (Also of note actually there's also e.g. x-systemd.makefs, so for these type of use cases, it can even suffice to just reserve the block device space at disk image generation time; IMO this can be even a best practice because doing things that way better supports a "factory reset" that blows away these external filesystems too)

achilleas-k commented 3 months ago

It does sort of tie the base image to the configuration (mountpoint in base image + bib build config), but that's probably fine.

Yeah; this is one reason why I was arguing to support embedding partitioning information in the image itself (it's also what systemd-repart is aiming for, though we have use cases beyond what that tool does).

We should get back on this. The idea was good, but I think we kept getting lost in some details. Partitioning descriptions and configurations aren't simple (they can be, but people quickly want to do more when you give them a little), so I don't think we should start adding config keys to a file arbitrarily without thinking about what it might look like when it grows.

Does it make sense though to relax the first rule for some paths? IIUC, there's no harm in creating a /var/data directory and mounting a separate filesystem there without /var/data existing in the base image, is there?

Yep, subdirectories of /var are totally fine. Though note that the default for .mount units is to create the directory - so all that osbuild (or anaconda) need to do here is set up the desired filesystem. (Also of note actually there's also e.g. x-systemd.makefs, so for these type of use cases, it can even suffice to just reserve the block device space at disk image generation time; IMO this can be even a best practice because doing things that way better supports a "factory reset" that blows away these external filesystems too)

Are we thinking about standardising on something like this? This seems similar to the user creation issues, where we would love for useradd (or some wrapper) to do all the bits that are required for the bootc world. Are we considering doing something that:

  1. Puts a config somewhere in the image, telling bib and anaconda that "a partition of size N and type T is required for M".
  2. Creates a .mount unit for creating the mountpoint mounting the filesystem at boot, possibly using some predictable attribute (label??).