osbuild / bootc-image-builder

A container for deploying bootable container images.
https://osbuild.org
Apache License 2.0
116 stars 50 forks source link

using bootc install-to-filesystem #18

Open cgwalters opened 9 months ago

cgwalters commented 9 months ago

This relates to https://github.com/osbuild/osbuild-deploy-container/issues/4

cgwalters commented 9 months ago

bootc install-to-filesystem should also grow support for being provided the base container image externally

Digging in, this is messier than I thought. Still possible, but @ondrejbudai can you state more precisely the concern you had with having bootc install from the running container?

ISTM that in general going forward we'll want to support running images cached in the infrastructure, which will drive us towards using containers-storage most likely, as opposed to e.g. the dir transport. And if we do that, ISTM it's just simpler to keep bootc doing exactly what it's doing today in fetching from the underlying store as opposed to having something else push content in, right?

achilleas-k commented 9 months ago

Just to clarify, because there are two ideas here that sound very similar but are probably unrelated:

If I'm understanding everything correctly (and if I'm remembering everything from yesterday's conversation), @ondrejbudai's idea to mount the container and run it in bwrap is the alternative to this, but like you said, bootc wont like that as it makes some container-specific assumptions.

ondrejbudai commented 9 months ago

I would actually combine #1 with mounting the container.

1) Mount the container and chroot into it (in osbuild terms, construct a buildroot by "exploding" the container, and use this a build pipeline for the following steps) 2) Partition a disk file using tools from inside the container 3) Mount the disk file to /target 4) Somehow get the container image in the oci format to e.g. /source/container.tar 5) Run bootc install-to-filesystem --source oci-archive:/source/container.tar --target /target

Note that I do have a slight preference for passing a whole container storage instead of an oci archive.

cgwalters commented 9 months ago

Just to level set, this today is sufficient to generate a disk image:

$ truncate -s 20G /var/tmp/foo.disk
$ losetup -P -f /var/tmp/foo.disk
$ podman run --rm --privileged --pid=host --security-opt label=type:unconfined_t quay.io/centos-bootc/fedora-bootc:eln bootc install --target-no-signature-verification /dev/loop0
$ losetup -d /dev/loop0
cgwalters commented 9 months ago

Backing up to a higher level, I think there are basically two important cases:

cgwalters commented 9 months ago

Also https://github.com/containers/bootc/pull/215 can't work until bootc-image-builder starts using bootc.

achilleas-k commented 9 months ago

Backing up to a higher level, I think there are basically two important cases:

* Generating a disk image from a container image stored in `containers-storage`: notably this is the most obvious flow in podman-desktop on Mac/Windows.  Copying that into a `dir` or `oci-archive` is just an unnecessary performance hit.

Which phase of the build is this referring to? If it's about having the stage in osbuild use the host containers-storage directly, I think the performance hit isn't entirely unnecessary but gives us the caching and reproducibility guarantees that we get with osbuild. These aren't directly relevant to the current use case (running it all in an ephemeral container), but I'm also thinking about the whole disk image built use a case more generally (using the same code and flow in the service). Or is this just about having the osbuild containers cache be itself a containers-storage? That's definitely an idea I'd like to explore. If we're talking about having a convenient way of using the host's containers-storage in the bootc-image-builder container, I think that's a lot simpler.

* Generating a disk image from a container in a remote registry: this will happen in many production build flows.  It seems simplest then if we try to unify this with the first case by always pulling into `containers-storage`, right?

Generalising any solution to both cases would be preferable, I agree.

achilleas-k commented 9 months ago

the caching and reproducibility guarantees that we get with osbuild

Thinking about this a bit more, I realise my hesitation is mostly around modifying the caching model substantially but now I'm thinking there's a good way to do this with a new, different kind of source. A containers-storage source could use the host container storage as its backend directly and pass it through to the stage.

The one "unusual" side effect would be that osbuild would then have to pull a container into the host machine's containers-storage, which I guess is fine (?). But what happens if osbuild, running as root, needs to access the user's storage? What if it writes to it?

cgwalters commented 9 months ago

But what happens if osbuild, running as root, needs to access the user's storage? What if it writes to it?

One thing that can occur here is that a user might be doing their container builds with rootless podman; so when they want to go make a disk image from it we'd need to copy it to the root storage. Things would seem to get messy to have a root process with even read access to a user storage because there's locking involved at least.

achilleas-k commented 9 months ago

so when they want to go make a disk image from it we'd need to copy it to the root storage

I think this makes sense. I'd want to make it explicit somehow that osbuild is doing this. It's one thing to write stuff to a system's cache when building images with osbuild (or any of IB-related projects), it's another thing to discover that your root container store now has a dozen images in it from a tool that some might think of as unrelated to "container stuff".

achilleas-k commented 9 months ago

Pinging @kingsleyzissou here since he's working on this.

cgwalters commented 9 months ago

Which phase of the build is this referring to? If it's about having the stage in osbuild use the host containers-storage directly, I think the performance hit isn't entirely unnecessary but gives us the caching and reproducibility guarantees that we get with osbuild.

I'm not quite parsing this (maybe we should do another realtime sync?) - are you saying using containers-storage is OK or not?

Backing up to a higher level, I think everyone understands this but I do want to state clearly the high level tension here because we're coming from a place where osbuild/IB was "The Build System" to one where it's a component of a larger system and where containers are a major source of input.

I understand the reasons why osbuild does the things it does, but at the same time if those things are a serious impediment to us operating on and executing containers (as intended via podman) then I think it's worth reconsidering the architecture.

These aren't directly relevant to the current use case (running it all in an ephemeral container), but I'm also thinking about the whole disk image built use a case more generally (using the same code and flow in the service).

It's not totally clear to me that in a service flow there'd be significant advantage to doing something different here; I'd expect as far as "cache" fetching images from the remote registry each time wouldn't be seriously problematic. For any cases where it matters one can use a "pull-through registry cache" model.

Or is this just about having the osbuild containers cache be itself a containers-storage? That's definitely an idea I'd like to explore.

That seems related but I wouldn't try to scope that in as a requirement here. Tangentially related I happened to come across https://earthly.dev/ recently which deeply leans into that idea. At first I was like the "Makefile and Dockerfile had a baby" was kind of "eek" but OTOH digging in more I get it.

LorbusChris commented 9 months ago

Backing up to a higher level, I think there are basically two important cases:

* Generating a disk image from a container image stored in `containers-storage`: notably this is the most obvious flow in podman-desktop on Mac/Windows.  Copying that into a `dir` or `oci-archive` is just an unnecessary performance hit.

* Generating a disk image from a container in a remote registry: this will happen in many production build flows.  It seems simplest then if we try to unify this with the first case by always pulling into `containers-storage`, right?

Coming from the OpenShift/OKD side, I think ideally the tool for ostree container to disk image conversion can be run independently of osbuild, i.e. it can also be wrapped by other pipeline frameworks such as prow, tekton, argo workflows, and even jenkins for any kind of CI/CD or production build.

Agreeing on keeping the container images in containers-storage everywhere seems fine to me.

LorbusChris commented 9 months ago

@achilleas-k it sounds with using an alternative root for the ostree container storage (with https://github.com/containers/bootc/pull/215) your concerns regarding all the images getting pulled into the machine's main container-storage might be addressed? IIUC, the ostree container-storage could be kept completely separate and e.g. live on a volume that gets mounted during the pipelinerun.

achilleas-k commented 9 months ago

Sounds like a good solution yes.

achilleas-k commented 9 months ago

Which phase of the build is this referring to? If it's about having the stage in osbuild use the host containers-storage directly, I think the performance hit isn't entirely unnecessary but gives us the caching and reproducibility guarantees that we get with osbuild.

I'm not quite parsing this (maybe we should do another realtime sync?) - are you saying using containers-storage is OK or not?

Well, at the time when I wrote this I was thinking it might be a problem but in my follow-up message (admittedly, just 5 minutes later) I thought about it a bit more and changed my mind.

Backing up to a higher level, I think everyone understands this but I do want to state clearly the high level tension here because we're coming from a place where osbuild/IB was "The Build System" to one where it's a component of a larger system and where containers are a major source of input.

I agree that this tension exists and it's definitely good to be explicit about it. I don't think the containers being a source of input is that big of an issue though. The containers-store conversation aside (which I now think is probably a non-issue), I think a lot of the tension comes from osbuild making certain decisions and assumptions about its runtime environment that are now changing. There was an explicit choice to isolate/containerise stages that are (mostly) wrappers around system utilities. Now we need to use utilities (podman, bootc) that need to do the same and it's not straightforward to just wrap one in the other. For example, right now, our tool is started from (1) podman, to call osbuild which runs (2) bwrap to run rpm-ostree container image deploy .... Replacing that with bootc requires starting from (1) podman to call osbuild which will run (2) bwrap to call (3) podman to run (4) bootc, and bootc will need to "take over" a filesystem and environment that is running outside of (3) podman.

I understand the reasons why osbuild does the things it does, but at the same time if those things are a serious impediment to us operating on and executing containers (as intended via podman) then I think it's worth reconsidering the architecture.

At the end of the day we can do whatever's necessary. The architecture is the way it is for reasons but those reasons change or get superseded. I think a big part of the tension is coming from me (personally) trying to find the balance between "change everything in osbuild" and "change everything else to fit into osbuild" (and usually leaning towards the latter because of personal experience and biases). Practically, though, the calculation I'm trying to make is which point between those two gets us to a good solution faster.

This is all to say, the source of the containers in my mind is a smaller issue to the (potentially necessary) rearchitecting of some of the layers I described above. We already discussed (and prototyped) part of this layer-shaving for another issue, and I think this is where we might end up going now (essentially dropping the (2) bwrap boundary).

These aren't directly relevant to the current use case (running it all in an ephemeral container), but I'm also thinking about the whole disk image built use a case more generally (using the same code and flow in the service).

It's not totally clear to me that in a service flow there'd be significant advantage to doing something different here; I'd expect as far as "cache" fetching images from the remote registry each time wouldn't be seriously problematic. For any cases where it matters one can use a "pull-through registry cache" model.

I wasn't trying to suggest we wouldn't cache in the service. I just meant to say that, if we tightly couple this particular build scenario to having a container store, we'd also have to think about how that works with our current service setup. But I might be overthinking it.

Or is this just about having the osbuild containers cache be itself a containers-storage? That's definitely an idea I'd like to explore.

That seems related but I wouldn't try to scope that in as a requirement here.

Given the comments that came later in this thread, I think I have a much clearer picture of what a good solution looks like here.

cgwalters commented 9 months ago

I'm working on https://github.com/ostreedev/ostree/pull/3114 and technically for the feature to work it requires the ostree binary performing an installation to be updated. With the current osbuild model, that requires updating the ostree inside this container image in addition to being in the target image. With bootc install-to-filesystem, it only requires updating the target container.

achilleas-k commented 9 months ago

@ondrejbudai and I (mostly Ondrej) made a lot of progress on this today. There's a lot of cleaning up needed and we need to look into some edge cases, but we should have something to show (and talk about) on Monday.

ondrejbudai commented 9 months ago
podman run --rm --privileged --pid=host --security-opt label=type:unconfined_t quay.io/centos-bootc/fedora-bootc:eln bootc install --target-no-signature-verification /dev/loop0

While running this command in osbuild should be possible, it means that we have a container inside a container, which seems needlessly complex. Thus, we tried to decouple bootc from podman. The result is in this branch: https://github.com/containers/bootc/compare/main...ondrejbudai:bootc:source

I was afraid that it would be hard, but it actually ended up being quite simple and straightforward. We also have a PoC with required changes to osbuild, new stages and a manifest. Note that this also needs https://github.com/osbuild/osbuild/pull/1501, otherwise bootupd fails on grub2-install.

The most important thing that this branch does is that it adds a --source CONTAINER_IMAGE_REF argument. When this argument is used, bootc no longer assumes that it runs inside a podman container. Instead, it uses the given reference to fetch the container image. It's important to note that bootc still needs to run inside a container created from the given image, however that's super-simple to achieve in osbuild.

If we decide to go this way, using bootc install-to-filesystem in bootc-image-builder seems quite straightforward. We are happy to work on cleaning-up the changes required in bootc and adding some tests to the bootc's CI in order to ensure that --source doesn't break in the future.


We think the the method above is acceptable for osbuild. However, it's a bit weird, because all the existing osbuild manifests build images in these steps:

1) Prepare the file tree 2) Create a partitioned disk 3) Mount it 4) Copy the file tree into the disk 5) Install the bootloader

Whereas with bootc install-to-filesystem --source, it becomes:

1) Create a partitioned disk 2) Mount it 3) Install everything

This has pros and cons: There's less I/O involved (you don't need to do the copy step), but the copy stage isn't actually something that's taking too much time in comparison with other steps. The disadvantage is that you cannot easily inspect the file tree, because osbuild outputs just the finished image. This hurts our developer experience, because when debugging an image, you usually want to see the file tree, which osbuild can easily output if use the former flow.

Upon inspecting bootc, it might not be that hard to split bootc install-to-filesystem into two commands:

1) Prepare the file tree 2) Install the bootloader and finalize the partitions

Then the osbuild flow might just become:

1) Call bootc prepare-tree 2) Create a partitioned disk 3) Mount it 4) Copy the file tree into the disk 5) Call bootc finish-disk

This would probably mean some extra code in bootc, but it might be worth just doing that instead of paying the price in osbuild and harming its useability. Note that nothing changes with the way how currently bootc is used in the wild.

@cgwalters wdyt?

dustymabe commented 9 months ago

Note that this also needs osbuild/osbuild#1501, otherwise bootupd fails on grub2-install.

glad I could help, and at the right time too :)

mvo5 commented 9 months ago

Fwiw, I am working on extracting the "container as buildroot" parts of https://github.com/osbuild/osbuild/compare/main...ondrejbudai:osbuild:bootc in https://github.com/osbuild/images/compare/main...mvo5:add-container-buildroot-support?expand=1 so that it can be used in boot-image-builder (still a bit rought in there ). It would also fix the issue that we cannot build stream9 images right now (which is the main intention of this work but it's nice to see that it seems like it's generally useful).

cgwalters commented 9 months ago

The result is in this branch: https://github.com/containers/bootc/compare/main...ondrejbudai:bootc:source

First patch is an orthogonal cleanup, mind doing a PR with just that to start?

Then another PR with the rest?

This hurts our developer experience, because when debugging an image, you usually want to see the file tree, which osbuild can easily output if use the former flow.

But...the file tree is already a container which you can inspect with podman run etc. right?

cgwalters commented 9 months ago

bootc install-to-filesystem --source

BTW just a note, this approach will require https://github.com/ostreedev/ostree/pull/3094 in the future because we already have problems with the fact that ostree (and in the future, bootc) really want to own the real filesystem writes and osbuild is today not propagating fsverity.

cgwalters commented 9 months ago

re https://github.com/containers/bootc/commit/a3c559300a2b7e30681fc05e4edfe2b064c6947b I wrote https://github.com/containers/bootc/pull/225 (totally not tested though) that I think will be a cleaner fix here.

cgwalters commented 9 months ago

I've been thinking about this more and in the end, I am definitely not opposed to the approach proposed - the changes would probably indeed be maintainable.

And I agree that it's very important to make the systems we design "introspectable/debuggable/visualizable/cacheable" etc. - and ultimately "filesystem trees" and their properties make up a lot of that.

However...I hope everyone would agree that for what we're doing here, 95% of the content comes from the container image, which we already have tooling to do all those things with. But yes, for injecting other filesystem-level state (whether that's users, secrets, etc.) it is important to be able to introspect/etc. it

Here's a counter proposal which basically builds on top of https://github.com/containers/bootc/issues/190 - bootc-image-builder accepts things like blueprints as input etc. (maybe in the future kickstarts, whatever) and ultimately the result of that operation is always a "layer".

(Hmm incidentally it'd be a really good idea to be sure we treat the semantics of blueprint execution in the same manner as we do for the host system, i.e. disallow writes to /usr for example; I'm not sure we do that today?)

So It would actually make sense I think to implement things the same way container stacks do, using overlayfs and serialize the result of that (in a clear distinct fashion from the base image) - then the connection with the above bootc proposal is I can choose to push that filesystem tree (layer) to a registry too - versioning, mirroring, managing, signing it the same way I do other container content - and moving the "blueprint" -> "filesystem layer" to more of a build step.

cgwalters commented 9 months ago

Also @ondrejbudai based on that code I've invited you to be a bootc committer fwiw :smile:

ondrejbudai commented 9 months ago

But...the file tree is already a container which you can inspect with podman run etc. right?

Well, if the tree inside the bootable image was the same as in the container image, we would just need to run cp -a instead of bootc. :upside_down_face:

bootc install-to-filesystem --source

BTW just a note, this approach will require ostreedev/ostree#3094 in the future because we already have problems with the fact that ostree (and in the future, bootc) really want to own the real filesystem writes and osbuild is today not propagating fsverity.

Haven't seen this one before. I agree that this is slightly annoying in osbuild, but it can be solved by the postprocess step that Alex implemented.

Here's a counter proposal which basically builds on top of containers/bootc#190 - bootc-image-builder accepts things like blueprints as input etc. (maybe in the future kickstarts, whatever) and ultimately the result of that operation is always a "layer".

Is it a counter proposal? I have a feeling that these proposals support each other, but I might be misinterpreting your proposal.

Btw, I'm not fully opposed to just dropping tree-level customizations (=adding users, files, enabling services, ...) from bootc-image-builder. However, I definitely see a great value in them. The ability to take a random bootable container image, inject a user using bootc-image-builder, boot the image and be immediately able to log in and tinker is very nice. All other methods (ignition/overlays/extra layer) AFAIK require an additional step.

(Hmm incidentally it'd be a really good idea to be sure we treat the semantics of blueprint execution in the same manner as we do for the host system, i.e. disallow writes to /usr for example; I'm not sure we do that today?)

Yup! :)

So It would actually make sense I think to implement things the same way container stacks do, using overlayfs and serialize the result of that (in a clear distinct fashion from the base image) - then the connection with the above bootc proposal is I can choose to push that filesystem tree (layer) to a registry too - versioning, mirroring, managing, signing it the same way I do other container content - and moving the "blueprint" -> "filesystem layer" to more of a build step.

I need your help understanding this paragraph. My final proposal was this one:

1) Call bootc prepare-tree 2) Create a partitioned disk 3) Mount it 4) Copy the file tree into the disk 5) Call bootc finish-disk

Do you want bootc-image-builder to be able to push the result of the first step 1 as a single layer OCI image? And if customizations are involved, this would become:

1) Call bootc-prepare-tree 2) Create an overlayfs over the tree 3) Perform any customization from a blueprint 4) Push this tree as two layers

I'm happy to implement this, but I'm not sure about use cases for this workflow. Is this mainly about debugging? It has a potential of introducing more complexity. If I get it right, this would be completely optional, but still - it's kinda hard to explain what the resulting artifact is. I guess we can solve this by explicitly marking this artifact as useful for debugging only.

Do you expect bootc-image-builder to be able to consume such an artifact as an input? Basically:

1) Pull the container image 2) Create a partitioned disk 3) Mount it 4) Copy the content of the container image into the disk 5) Call bootc finish-disk

This means that bootc finish-disk needs to do one final round of selinux relabeling, because AFAIK selinux labels aren't available in OCI images. Not a big deal I think, just something we must not forget.

Anyway, I might have completely misunderstood your idea, so feel free to correct me on everything I'm wrong on. :)

cgwalters commented 9 months ago

Well, if the tree inside the bootable image was the same as in the container image, we would just need to run cp -a instead of bootc. 🙃

True. However, I hope you'd agree that this is again a corner case; < 5% of debugging cases would need to dig into this distinction - the ostree stuff is a background thing. It's a very similar thing to looking at a container image versus how containers/storage represents it on disk in /var/lib/containers.

(But yes in the bootc/ostree case there are some interesting things there like how we set up the /boot filesystem and kernel arguments)

Haven't seen this one before. I agree that this is slightly annoying in osbuild, but it can be solved by the postprocess step that Alex implemented.

(This is somewhat tangential but) another case I just realized will break with this is reflinks; ostree uses them today for /etc (if available as a minor optimization) but we've talked about just using them (if available) across the board as a "resilience against accidental mutation" for the deployment root. But cp -a doesn't "preserve" reflinks in this way - if the source is on a separate filesystem then nothing will be linked, but if they are we'll get two independent files reflinked to the source, not to each other and hence not shared after the cache is deleted.

To be clear this isn't a serious problem today because for the /etc case it will just fix itself on the first upgrade (as ostree takes over and performs the writes) and the sizes are small. But if we did reflinks for the deployment root, that wouldn't be true today (unless we also fix up that in the post-copy bit).

Also backing up closer to the topic here it's notable there isn't a way to represent reflinks in OCI - because they're not represented in tarballs, and tarballs are a "lowest common denominator" thing.

Btw, I'm not fully opposed to just dropping tree-level customizations (=adding users, files, enabling services, ...) from bootc-image-builder.

I'm not saying that at all! I think everyone agrees that we need functionality like this. But, that bootc issue is also arguing to support that step at the bootc install time phase, which is orthogonal to generating a disk image. To elaborate on this if we supported that in addition (not arguing for dropping the ability to inject files at disk image generation time!) then it'd also work the the same way in anaconda.

This means that bootc finish-disk needs to do one final round of selinux relabeling, because AFAIK selinux labels aren't available in OCI images.

This is a messy topic...a lot of related discussion in https://github.com/containers/storage/pull/1608 - and I may have been wrong there actually and we could just write the labels into the OCI archive. I was perhaps too chicken to be sure that'd work across the ecosystem.

Anyways though...hmmm...I would say that "materialize intermediate steps as OCI" is potentially interesting just for introspection/debugging but we shouldn't try to support "push them to a registry" unless the use case becomes obvious. (That said I linked this in a different discussion but I came across https://earthly.dev/ recently which leans heavily into the idea of caching general build artifacts in OCI)

I need your help understanding this paragraph. My final proposal was this one:

1. Call bootc prepare-tree
2. Create a partitioned disk
3. Mount it
4. Copy the file tree into the disk
5. Call bootc finish-disk

(Mechanically let's prefix this with bootc install as this is all sub-functionality of that...hmm, if we're going to grow more here it'd look better as bootc install to-disk and bootc install to-filesystem and then we have bootc install prepare-tree too.)

Hmmmm. So I'd say short term I am not opposed to this proposal and I think we can get your patches in. However...let me try to re-describe how I'm thinking of things.

Actually here's the key bit in what I'm proposing: the file content that osbuild injects is build as a container layer, using the input container as a base image. So the flow would look like:

  1. Take blueprint (or kickstart, or whatever) high level description of extra system state and "build" it. Let's start with a very crude implementation:
    FROM <input container image>
    COPY osbuild-render-blueprint /tmp
    COPY blueprint.json /tmp/
    RUN /tmp/osbuild-blueprint-execute /tmp/blueprint.json && rm /tmp/* -rf

We can then take just the final layer from this process (i.e. a tarball) and save it as image-overlay.oci (i.e. just wrap that final layer tarball as its own OCI "image"). This filesystem tree would include things like a modified /etc/passwd and /var/home/someuser/.ssh/authorized_keys.

Now this bit could even happen in parallel

  1. Inspect container image to fetch partitioning information (or use default embedded in the image, or use externally specified partitioning in e.g. anaconda cases)
  2. Create partitioned disk
  3. Mount it

Then finally, we put things together:

  1. bootc install-to-filesystem (and crucially, if available we also pass --with-overlay=oci:/path/to/cache/image-overlay.oci)
  2. Clean things up by unmounting, closing up loopback device if appropriate etc
  3. Perform all final transformations on the target disk image (e.g. convert to VMDK, etc.)

This means that bootc finish-disk needs to do one final round of selinux relabeling, because AFAIK selinux labels aren't available in OCI images. Not a big deal I think, just something we must not forget.

So in my proposal (bootc install --with-overlay), it'd probably be bootc which does the SELinux labeling on the final filesystem tree, as it does for the container image it takes as input.

cgwalters commented 9 months ago

BTW I just put up https://github.com/containers/bootc/pull/226 which will clear the way for having other bootc install sub-commands.

ondrejbudai commented 9 months ago

bootc vol.2

Thanks, Colin, this makes much more sense than my interpretation.

I think that your idea might actually play very well with both bootc and osbuild. Let me present how I understand the flow in a pseudo-bash script. In the end, this would be a single osbuild manifest, but let's make it high-level for now.

SOURCE=quay.io/centos-bootc

# "Deploy" the container
$container=$(podman image mount $SOURCE)
cp -a $container/. /tmp/container

# Build the overlay OCI image
mkdir /tmp/osbuild-customizations
mount -t overlayfs overlay -olowerdir=/tmp/container,upperdir=/tmp/osbuild-customizations /tmp/merged
osbuild-apply-customizations /tmp/blueprint.toml /tmp/merged
umount /tmp/merged
osbuild-create-a-single-layer-oci-archive /tmp/osbuild-customizations /tmp/container/overlay.tar

# Partition a disk
truncate -s 10G /tmp/disk
fdisk [...] /tmp/disk
losetup -Pf /tmp/disk
mkfs.* /dev/loop0p{1,2,3}
mount /dev/loop0p{1,2,3} /tmp/container/tree{/,/boot,/boot/efi}

# Fetch the container image
skopeo copy docker://$SOURCE oci-archive:///tmp/container/container.tar

# Bootc install (this is of course bubblewrap in osbuild, but let's keep it simple)
chroot /tmp/container \
  bootc install to-filesystem \
    --source oci:archive:///container.tar \
    --with-overlay oci:archive:///overlay.tar \
    --generic-image --[...]
    /tree

# Umount and unloop everything
umount /tmp/tree**
losetup -D /dev/loop0*

# Convert to qcow2
qemu-img convert -f qcow2 -O /tmp/final-image.qcow2 /tmp/disk

I think we need the following things for this happen:

1) bootc gains the --source argument (and some minor compatibility patches, see my branch) 2) bootc gains the --with-overlay argument 3) osbuild gains the podman image mount capabilities 4) osbuild gains the capabilities to work with overlayfs 5) osbuild gains the capabilities to run bootc install to-filesystem 6) bootc gets bootc-image-builder running in its CI (rev-dep tests)

Our team does 3, 4, 5, @cgwalters does 2, 1 & 6 are a shared effort, I think we can write the code.

Does this sound plausible? Can we commit to this?


Note that you also wrote about the capability of splitting build process into the two steps: Firstly an overlay is built and pushed into a registry. Then, someone else pulls the overlay and applies it. I think that's absolutely doable, but I would focus on the proposed flow firstly, because it's just a single build, thus simpler. I'm definitely happy to work on splitting the thing (optionally) afterwards.

cgwalters commented 9 months ago
# Fetch the container image
skopeo copy docker://$SOURCE oci-archive:///tmp/container/container.tar

In order to run podman image mount above we had to fetch the image and store it in containers-storage, so why are we fetching again?

Working backwards then it opens up the question for

# Build the overlay OCI image

Why couldn't this just be a podman build?

And then going back down I still find myself wondering about the need for bootc install --source and using bwrap instead of just running the original container via podman, and just passing in the overlay layer via bind mount or so.


Some other half-baked thoughts that didn't work out hidden but kept for reference here:

Thinking about this a bit more maybe it would significantly simplify everything if bootc itself learned to support "peeling" the top layer in this scenario. IOW b-i-b would actually just do: ``` FROM COPY blueprint.toml /tmp RUN osbuild-apply-customizations --consume /tmp/blueprint.toml ``` (`--consume` here would unlink the file so we don't leak it into the container's `/tmp`) Then we `podman run bootc install --peel ...`, and bootc does its current thing where it goes out to the container storage and installs the underlying ``, and then takes that top layer and just dumps it into the deployment root. Where here `--peel` means to apply this semantic of dumping the top layer to the target persistent storage. Hmm well...actually I think it's messy because we'd really need the true *original* image reference including its manifest/config etc. So perhaps we do need to just explicitly split things. So nevermind...
cgwalters commented 8 months ago

Anything I can do to help progress this?

cgwalters commented 8 months ago

Yet another thing this would fix is I just noticed that this project hardcodes ext4 ...but we expect the default filesystem to be xfs (from https://github.com/CentOS/centos-bootc/blob/9e48ca73c43f56da14cbf66f72293f82df616371/tier-0/bootc-config.yaml#L9 )

ondrejbudai commented 8 months ago

Are you sure about this? Even with bootc install to-filesystem, bootc-image-builder will be responsible for partitioning, not bootc. We need to talk about partitioning fairly soon, it is something pretty high on my todo-list.

cgwalters commented 8 months ago

OK...yes, we can live with duplication between the hardcoded partition table layout (sizing, GUIDs, etc.) in bootc and the hardcoded partition layout(s) here (and the partition layouts in anaconda and various kickstarts etc.)

But...how about, even if bib uses install to-filesystem, I think the default filesystem type should come from the container image. We could add bootc install --introspect which would output a merged JSON file from all the bootc install configuration files that live in the image.

And right now that set is really small, just rootfs and kargs (but using bootc install to-filesystem would handle kargs itself).

I mean...how could it even work otherwise? Because different products have different needs (e.g. fedora desktops want btrfs), and it doesn't make sense to hardcode things in bib, or to force everyone to specify it in bib.

Or hmm...we could add a new hybrid in bootc install to-partitioned-disk that would still leave the creation of the filesystems in the partition to the bootc config.

We would also appear more coordinated if the at-disk-time configuration looked the same as the in-container case; and on that topic I'd reiterate that I'm totally open to change for what the bootc install configuration looks like. It feels like TOML is more friendly than JSON, but...dunno, if someone felt really strongly that the existing bib JSON is better, we can definitely figure out how to add/support that in bootc's baseline. Or maybe wholesale switch, I dunno.

cgwalters commented 8 months ago

I would also like to know...what problems you see with using bootc install to-disk by default? Is it about sharing code with things like the IoT and cloud guest images? I can understand that a bit but...

Honestly, my far bigger worry is drift with the anaconda configuration - I feel like we are glossing over that, and for example this bug is really bad - it ruins transactional updates! And I mean, man if I hadn't happened to just see that PR go by we could have easily had a system deployed in the field on bare metal that lost power during an update, their /boot partition is just scrambled and we have to painfully find out that the filesystem was ext2. No one would likely have noticed it before then...I guess there's some call here for "conformance" testing.

cgwalters commented 8 months ago

PR in https://github.com/containers/bootc/pull/272

mvo5 commented 8 months ago

The osbuild side to use bootc instal to-filesystem got started in https://github.com/osbuild/osbuild/pull/1547

achilleas-k commented 8 months ago

OK...yes, we can live with duplication between the hardcoded partition table layout (sizing, GUIDs, etc.) in bootc and the hardcoded partition layout(s) here (and the partition layouts in anaconda and various kickstarts etc.)

But...how about, even if bib uses install to-filesystem, I think the default filesystem type should come from the container image. We could add bootc install --introspect which would output a merged JSON file from all the bootc install configuration files that live in the image.

And right now that set is really small, just rootfs and kargs (but using bootc install to-filesystem would handle kargs itself).

I mean...how could it even work otherwise? Because different products have different needs (e.g. fedora desktops want btrfs), and it doesn't make sense to hardcode things in bib, or to force everyone to specify it in bib.

Or hmm...we could add a new hybrid in bootc install to-partitioned-disk that would still leave the creation of the filesystems in the partition to the bootc config.

We would also appear more coordinated if the at-disk-time configuration looked the same as the in-container case; and on that topic I'd reiterate that I'm totally open to change for what the bootc install configuration looks like. It feels like TOML is more friendly than JSON, but...dunno, if someone felt really strongly that the existing bib JSON is better, we can definitely figure out how to add/support that in bootc's baseline. Or maybe wholesale switch, I dunno.

I think it makes sense to have an agreed-upon default partition table at each level of the image building process. That is, if a container doesn't specify a PT, BIB should still be able to build a disk image, or generate a kickstart, with the default partition table for the distro. That's why we have partition tables hard-coded in BIB. They can be fallbacks for base containers with no information.

Now, if we're guaranteed to always get a valid PT when querying a bootc container, maybe BIB can drop the hardcoded tables, but we still need to represent PTs internally to generate the stages for osbuild to create the disk. So regardless of where the default PT is defined, at some point that PT will have to make its way into BIB.

The existing BIB JSON representation for filesystem customizations and the blueprint TOML are identical (serialised to a different format) but neither are enough to represent a full table, just customizations on top of an existing one. The internal representation of the partition tables, that is, the disk.PartitionTable type from osbuild/images, is closer to the output of sfdisk -J than anything (with obvious differences like the addition of Payload for filesystems). The hardcoded tables aren't fully described either; they're not aligned because it's expected that customizations will be applied and they will be re-aligned at manifest generation time.

So in the end if bootc will be able to produce a partition table for BIB to consume in order to generate the osbuild manifest, the representation of that partition table will effectively be an "input configuration" for BIB, one that could, theoretically, be supplied separately from a container. I think there's no reason to force a PT description to always come from the container. It can in the default case, but if a user wants to build a disk from a container and override the PT, I think they should be able to, and it makes sense to reuse whatever representation we come up with. And we can support flexibility, because it would be useful if a user-supplied configuration (or even a container-supplied one) was valid even when incomplete, meaning it could lack UUIDs, start sectors, even sizes (e.g., a partial PT with a total disk size, one sized partition, and one unsized partition that implies it will take up the remaining space).

cgwalters commented 8 months ago

Right now, the scope of what's in bootc as far as partitioning is tiny - just the root filesystem type. And I think that's by far the most important thing.

Hmm, though actually we should expose what is currently bootc install to-disk --block-setup tpm2-luks in the bootc install config too, though in the generic image case we need to re-encrypt at firstboot instead.

Stuff beyond that...it would be nice to deduplicate, but less critical. Things like the UUIDs are tricky, because when we're making a generic image, we should reset them on firstboot (we do this in FCOS derivatives and systemd-repart has logic for it too). In other words:

So in the end if bootc will be able to produce a partition table for BIB

Let's not try to do that in any kind of near term; it sounds complex.

I guess in some order of priority I'm thinking:

An example of something that would make sense I think is support for not creating a separate /boot aka xbootldr for the non-LUKS cases, but not at all critical in the short term. And that type of configuration should really come from the container; we want to be able to configure it per input OS version.

cgwalters commented 8 months ago

with the default partition table for the distro.

Right but we aren't detecting the distro here, which is a fundamental shift from how osbuild/images worked before. I think it probably would make sense to do so (right?) but a part of the goal of accepting containers as input is to allow this type of configuration to be managed in a distributed gitops fashion in a containerfile alongside other settings (as opposed to being centralized), and that I think drives towards containers holding defaults.

On the flip side though, there are currently projects not using bootc (and hence lacking a /usr/lib/bootc/install.d config) that would probably benefit from some default distribution detection. However even there...the idea that we'd have something simplistic like e.g. "if fedora, btrfs" is I think a somewhat controversial one, and again drive back towards pushing towards container configuration and only using hardcoded defaults as a last-ditch thing.

achilleas-k commented 8 months ago

drive back towards pushing towards container configuration and only using hardcoded defaults as a last-ditch thing.

Sure, it being a "last-ditch thing" is fine, but it still has to exist somewhere.

So in the end if bootc will be able to produce a partition table for BIB

Let's not try to do that in any kind of near term; it sounds complex.

I guess in some order of priority I'm thinking:

  • Root filesystem type
  • Root filesystem max size (edit: although actually, because we don't ship a default growpart unit in the base image, anyone deriving from that will have extra space by default)
  • tpm2-bound LUKS
  • NBDE/tang

That's all good but that's not enough information to create a bootable disk.

If the container isn't giving us a fully described partition table, we'll have to fill in the blanks for things like the size of the efi, or size and type for boot if we make one (which we would need to for LVM for example). Do we need a BIOS boot? Should we add a bigger offset for some specific devices or use msdos partitions because we want to support the Raspberry Pi 3? This is all information that needs to live somewhere and be used as sane defaults and sometimes enforced so that we're not creating unusable images.

We know all this and unless we're planning on telling users to figure it out themselves, we'll need some part of the tooling to make these decisions. So if the container/bootc is currently only giving us the root filesystem type, then the rest is a "hardcoded partition table" and the fs type for / can be xfs, or ext4, or null (with a dependence on the container filling in that gap).

cgwalters commented 8 months ago

Yes, I think we're in agreement, right? We will keep most of the partitioning copies in bib (and in anaconda's reqpart, and one or two strange places in Anaconda, in FCOS, in bootc (mostly derived from FCOS), fedora-cloud-base kickstart and...) for now and especially for bib just try to use the / type from the container via bootc configuration for now.

(I would also say, maybe we can change the bootc config to just look like the blueprint toml exactly, i.e. the container could have...wait a second, is changing the root filesystem type not exposed in blueprints today? So I guess we can't unify that yet)

achilleas-k commented 8 months ago

is changing the root filesystem type not exposed in blueprints today? So I guess we can't unify that yet

It's trivial to do. Even if we don't want it in osbuild-composer, we can extend the bootc version of the blueprint to include it and bring it to other projects later.

cgwalters commented 8 months ago

Or to summarize and rephrase clearly (I hope): let's narrow in on supporting configuring the rootfs in the container image, because that was already extant before and is IMO by far the most obvious one to configure and most impactful to the user/admin experience.

The technical implementation of that will eventually hopefully drive more unification/configuration down the line. Did https://github.com/containers/bootc/pull/272 look OK? It seems like it wouldn't be hard to use in bib, but again we can create whatever protocol we want here.

achilleas-k commented 8 months ago

Or to summarize and rephrase clearly (I hope): let's narrow in on supporting configuring the rootfs in the container image, because that was already extant before and is IMO by far the most obvious one to configure and most impactful to the user/admin experience.

Yes that's clear. But I also think we shouldn't oversimplify and think ahead a bit. At some point, our configuration structures will have to become stable. I don't think we're there yet, but we will be soon. So I'd like for us to have a good, high level view of where we're going. If we plan for the container partitioning metadata to grow, a simple flat json file wont cut it, so we might have to start thinking about the structure of that data now.

I feel like I'm overthinking this a bit, or maybe it's just premature optimisation, but I don't want to end up with an incomprehensible and unmaintainable config structure.

The technical implementation of that will eventually hopefully drive more unification/configuration down the line. Did containers/bootc#272 look OK? It seems like it wouldn't be hard to use in bib, but again we can create whatever protocol we want here.

It's not hard to use in bib, no. There's one thing to note here though: If we can retrieve the information from the container without downloading the whole image, that's a lot easier to integrate in how we do things currently. This is what I was hoping we could have if we put this kind of information in labels, because then one can inspect the container and prepare the manifest before the data needs to be downloaded, which fits the current osbuild flow. If we don't go that route and instead need to run the container to get this information, we'll have to do a bit of work for the manifest generation phase. It's also a lot less efficient because the image builder flow in general separates manifest generation (and metadata collection) from building (for example, manifest generation and building don't even happen on the same machine in the service). If any of this ever ends up in the image builder service, we might have to rethink some things.

achilleas-k commented 8 months ago

I think we should start a separate issue (or discussion) about partitioning.

achilleas-k commented 7 months ago

Partition table and filesystem discussion started at https://github.com/osbuild/bootc-image-builder/discussions/147

cgwalters commented 7 months ago

If we can retrieve the information from the container without downloading the whole image, that's a lot easier to integrate in how we do things currently. This is what I was hoping we could have if we put this kind of information in labels, because then one can inspect the container and prepare the manifest before the data needs to be downloaded, which fits the current osbuild flow.

It's possible, and I'm OK to change to a label. However...and this does relate to the partition table one; there's seems to be a fundamental clash going on here between "is the container image canonical/source-of-truth" versus "is manifest source of truth".

At a technical level today osbuild wants the "make a filesystem" staged to be fully fleshed out, and there's no way for stages to pass information between each other, so the filesystem stage would have to grow something like an "bootc-detect: true" option or something? Or we'd have to define an information passing mechanism.

I feel like a much simpler and container-native architecture that largely continues the goals of the idea of a "manifest" would be basically achieved directly via what we have already done here:

Then, I have a high degree of reproducibility by simply version locking (or pulling via @sha256 digest the three components to a disk image build: bib image, bib config (optional), my target bootc image).

I know there's a lot of history here, and it's quite possible I'm missing something; and AIUI running as a container would likely require changes to the current hosted service which is likely not trivial and probably argues for trying to find shorter term and easier changes to achieve container-native goals. But it's also important to try to have those short steps be aligned with a longer term roadmap.

achilleas-k commented 7 months ago

there's seems to be a fundamental clash going on here between "is the container image canonical/source-of-truth" versus "is manifest source of truth".

Unless the container itself is doing the disk creation, I don't see where the clash is. I don't view the manifest as the source of truth (generally). In some respect it is, but only as far as osbuild is concerned. Meaning, it's the instructions for what osbuild is supposed to do. If osbuild isn't going to do it (create the disk), then that probably wont appear in the manifest. If it will, then that info has to flow into osbuild somehow... that's the manifest.

At a technical level today osbuild wants the "make a filesystem" staged to be fully fleshed out

I don't see how this would work any other way. What's the alternative? Again, if it's going to be creating the disk, it needs to know what to create. If there's some other tool that's part of the container that's going to do it, then osbuild doesn't need to know about it.