ostreedev / ostree

Operating system and container binary deployment and upgrades
https://ostreedev.github.io/ostree/
Other
1.26k stars 290 forks source link

ex-integrity.composefs: Tracking issue #2867

Open cgwalters opened 1 year ago

cgwalters commented 1 year ago

composefs/ostree (and beyond)

Background

A key design goal of ostree at its creation was to not require any new functionality in the Linux kernel. The baseline mechanisms of hard links and read-only bind mounts suffice to manage views of read-only filesystem trees.

However, for Docker and then podman, overlayfs was created to more efficiently support copy-on-write semantics - also crucially, overlayfs is a layered filesystem; it can work with any underlying (modern) Linux filesystem as a backend.

More recently, composefs was created which builds on overlayfs with more integrity features. This tracking issue is for the integration of composefs and ostree.

System integrity

ostree does not provide significant support for truly immutable system state; a simple mount -o remount,rw /usr will allow direct persistent modification of the underlying files.

There is ostree fsck, but this is inefficient and manual, and further still today does not cover the checked-out deployment roots (so e.g. newly added binaries in the deployment root aren't found).

Accidental damage protection

It is important to ostree to support "user owns machine" scenarios, where the user is root on their own computer and must have the ability to make persistent changes.

But it's still useful to have stronger protection against accidental damage. Due to the way composefs works using fs-verity, a simple mount -o remount,rw can no longer silently modify files. First, the mounted composefs is always read-only; there is no write support in composefs. Access to the distinct underlying persistent root filesystem can be more strongly separated and isolated.

Support for "sealed" systems

It's however also desirable to support a scenario where an organization wants to produce computing devices that are "sealed" to run only code produced (or signed) by that organization. These devices should not support persistent unsigned code.

ostree does not have strong support for this model today, and composefs should fix it.

Phase 0: Basic integration (experimental)

In this phase, we will land an outstanding pull request which adds basic integration that enables booting a system using composefs as a root filesystem. In this phase, a composefs image is dynamically created on the client using the ostree metadata.

This has already led us to multiple systems integration issues. So far, all tractable.

A good milestone to mark completion of this phase is landing a CI configuration to ostree which builds and deploys a system using composefs, and verifies it can be upgraded.

In this phase, there is no direct claimed support for "sealed" systems (i.e. files are not necessarily signed).

Phase 1: Basic rootfs sealing (experimental)

In this phase, support for signatures covering the composefs is added. A key question to determine is when the composefs file format is stable. Because the PR up until this point defaults to "re-synthesizing" the composefs on the client, the client must reproduce exactly what was generated server side and signed.

Phase 2: Secure Boot chaining (experimental)

This phase will document how to create a complete system using Secure Boot which chains to a root filesystem signature using composefs.

This may also depend on https://github.com/ostreedev/ostree/issues/2753 and https://github.com/ostreedev/ostree/issues/1951

Here is a sketch for how we can support trusted boot using composefs and fs-verity signatures.

During build:

  1. Generate a public/private key pair
  2. Copy the public key into the new rootfs for the commit (e.g. /etc/pki/fsverity/cfs.pub)
  3. During initrd generation in the rootfs, pass --install /etc/pki/fsverity/cfs.pub to dracut, which will copy the public key into the initrd.
  4. Add a module to dracut that loads the public key into the fs-verity keyring (see https://gitlab.com/CentOS/automotive/rpms/dracut-fsverity for an example)
  5. Generate a UKI or aboot image containing the above initrd, the kernel and kernel command line. The kernel command line uses a generic ostree=latest argument, because at this point we don't know the final deployment id. See also discussion in https://github.com/ostreedev/ostree/pull/2844
  6. Sign the UKI with a private key that is trusted by your secureboot keyring.
  7. Stuff the UKI into the rootfs next to the normal kernel
  8. Save the rootfs as objects in the ostree repo, giving the digest of the rootdir
  9. ostree commit normally just stores the above digest in the metadata. But now we also take the private key from step 1 (passed as argument to ostree commit) and generate a composefs image file based on the rootdir digest. We sign this file with the private key and store the signature as extra metadata in the commit object.
  10. The entire commit object is GPG signed and pushed to a repo.

During install:

  1. Pull a new commit
  2. Verify the GPG signature
  3. When deploying we look at the metadata from the commit, in particular the rootdir digest and the signature. The rootdir digest (and the repo objects) is used to construct a new composefs file, and the signature is used to sign the composefs image file when enabling fs-verity on it.
  4. The BLS files are created in /boot/loader.[01] that points to the deploy dir with the composefs file, and the /boot/loader symlink is atomically switched to new loader.[01] dir. This BLS file contains the deploy id we deployed into in the kernel ostree=... arg.
  5. The UKI is put somewhere where the boot loader can find it. (EFI partition, aboot partition, etc)

During boot:

  1. The firmware loads the UKI, verifies it according to the secureboot keyring and boots kernel+initrd.
  2. The initrd mounts the disk partitions.
  3. The initrd notices the kernel arg "ostree=latest" and looks for the BLS file in /boot/loader with index=1 (i.e. most recent deploy, or index=2 if we're in fallback mode).
  4. The initrd parses the BLS file, which contains the full ostree=... argument. This lets us find the deploy directory (like /ostree/deploy/fedora-coreos/deploy/443ae0cd86a7dd4c6f5486a2283471b3c8f76fc5dcc4766cf935faa24a9e3d34.0). (Note at this point that we can't trust either the BLS file or the deploy dir.)
  5. The initrd loads /etc/pki/fsverity/cfs.pub into the kernel keyring for fs-verity. (Trusted, as its in signed initrd.)
  6. The initrd mounts the composefs with the LCFS_MOUNT_FLAGS_REQUIRE_SIGNATURE flag. This ensures that the file to be mounted has a signature, and thus can only be read if the matching public key is loaded in the keyring.
  7. On top of the composefs we bind mount writable things like /var and /etc.
  8. Pivot-root into the new composefs mount, which now will verify all further reads from the readonly parts of rootfs are valid.

Beyond

At this point, we should have gained significant experience with the system. We will determine when to mark this as officially stabilized after this.

Phase 3: "Native composefs"

Instead of "ostree using composefs", this proposes to flip things around, such that more code lives underneath the "composefs" project. A simple strawman proposal here is that we have the equivalent of ostree-prepare-root.service actually be composefs-prepare-root.service and live in github.com/containers/composefs.

Related issues:

Phase 4: Unified container and host systems

This phase builds on the native composefs for hosts and ensures that containers (e.g. podman) share backing storage with the host system and as much code as possible.

alexlarsson commented 1 year ago

About the composefs file format stability. The plan is to guarantee stability in general, and there is way to change it by specifying a version when you generate the file. However, I don't want to give any stability guarantees until the overlay xattr changes has landed in the upstream kernel, because only then do we know they will not change.

alexlarsson commented 1 year ago

Ok, I ran into a snag with this approach:

When doing an update, the new deploy is written, and when we enable fs-verity on it, with the signature, fs-verity fails. The reason is that the new signature is signed with the new certificate and the public key is not in the kernel keyring at the time of deploy.

We have a similar issue at image building time, where we would need to load the public key into the keyring of the host (i.e. build) machine.

It doesn't feel right to load any keys like this into the keyring at any time other then boot (and the keyrings are bound to be sealed anyway). So, I think we need to delay the application of the signature to the first boot, as we can then guarantee that the right keys are loaded.

cgwalters commented 1 year ago

This issue will be discussed this Friday at 9:00am EST in https://meet.jit.si/moderated/2e9be89e0e9ee06647b4719784578a6251f72eec9a07829bc9212e57c4883816

alexlarsson commented 1 year ago

I wrote down some random ramblings about the Phase 3 approach to kickstart the meeting/discussion:

Basic assumptions:

Points that needs consideration:

travier commented 1 year ago

More recently, composefs was created which builds on overlayfs with more integrity features. This design document describes the high

Looks like this sentence is cut before the end (from the first comment)

travier commented 1 year ago

The initrd notices the kernel arg "ostree=latest" and looks for the BLS file in /boot/loader with index=1 (i.e. most recent deploy, or index=2 if we're in fallback mode).

How do we know that we are in fallback mode?

travier commented 1 year ago

But, who will be responsible for the boot part of the deployment, like generating bls files, putting initrds from the images in the right place, merging /etc, rollback, etc. Its not clear where the border between composefs and ostree lies.

This is done by ostree & rpm-ostree in Fedora CoreOS for example.

travier commented 1 year ago

You can't even look at the contents for any composefs image other than the one you booted (at least when using per-build keys).

Isn't it possible to validate fs-verity signatures from userspace with requiring the key to be loaded in the kernel?

travier commented 1 year ago

+1 from me for this approach in general. Thanks for writing it up!

travier commented 1 year ago

The BLS files are created in /boot/loader.[01] that points to the deploy dir with the composefs file, and the /boot/loader symlink is atomically switched to new loader.[01] dir. This BLS file contains the deploy id we deployed into in the kernel ostree=... arg.

Note that in some cases (direct UEFI boot, with or without systemd-boot, with UKIs), there won't be BLS configs or they won't be used. https://github.com/ostreedev/ostree/issues/2753#issuecomment-1488587533 as an alternative proposal to let the initrd figure out which entry was booted and which ostree deployment should be used by storing the ostree deployment hash in the filename of the UKI and then reading it from the EFI variables in the initrd.

alexlarsson commented 1 year ago

You can't even look at the contents for any composefs image other than the one you booted (at least when using per-build keys).

Isn't it possible to validate fs-verity signatures from userspace with requiring the key to be loaded in the kernel?

The way fs-verity signatures work right now is that they are verified by the kernel automatically when you open the file.

If we had a standalone signature file paired with the non-signed composefs file we could do the validation in userspace like this. But if the composefs file was signed we can't even look at it until we've loaded the right key into the kernel.

osalbahr commented 1 year ago

Was the meeting recorded? I wanted to join but accidentally overslept.

cgwalters commented 1 year ago

Was the meeting recorded? I wanted to join but accidentally overslept.

Sorry, it wasn't. Probably should have. We decided to make this a recurring meeting, so there will be another one on Friday June 16 at the same time (9:30am EST).

I may also argue at some point that this should be a composefs meeting and not an ostree meeting and we'd do it alongside or in the github.com/containers context.

alexlarsson commented 1 year ago

So, there is a keyctl_pkey_verify() syscall: https://man7.org/linux/man-pages/man3/keyctl_pkey_sign.3.html I think using this during mount to verify a signature file is much better and more flexible than using the built-in fs-verity signatures, because you can then both access the composefs image file without the key, and enable fs-verity on it without knowing the public key. I will have a look at adding support for this to libcomposefs.

ericcurtin commented 1 year ago

5. Add

This issue will be discussed this Friday at 9:00am EST in https://meet.jit.si/moderated/2e9be89e0e9ee06647b4719784578a6251f72eec9a07829bc9212e57c4883816

Is there an .ics file, etc. for this meeting so I can add to my calendar?

cgwalters commented 1 year ago

Is there an .ics file, etc. for this meeting so I can add to my calendar?

I spent about 10 minutes doing a web search trying to figure out how to craft an ics file by hand with no luck...

ericcurtin commented 1 year ago

I got one anyway thanks, pity it's not a really simple right-click export calendar event type of thing

osalbahr commented 1 year ago

I’d assume most email providers have that built-in? At least Gmail does. Or at least a workaround https://webapps.stackexchange.com/questions/114322/google-calendar-share-single-event-as-ics

cgwalters commented 1 year ago

OK I know I'm bouncing things around here; I want to take the subthread related to ostree from here https://github.com/containers/composefs/issues/151 back here because it's about reusing ostree's existing signature infrastructure.

Here's my strawman...

Does this make sense?

cgwalters commented 1 year ago

OK, I started drafting a change which cleans up the metadata bits slightly (I think we want an API for this and not a repo flag, it's more flexible), and also starts dropping the current custom signature bits https://github.com/ostreedev/ostree/pull/2891 in preparation for switching to the above plan.

cgwalters commented 1 year ago

OK #2891 merged - the big next step is "change ostree-prepare-root to link to libostree" and I have some WIP on that but I think we need to do a release first.

Beyond that, with this signature plan some concerns were raised about dragging GPG into the initramfs. I think there are multiple answers to this:

Finally, we can also look at supporting a custom basic signing flow (as was originally done here), although personally I would lean a lot more towards supporting a "standard" one in composefs over a new custom one for ostree.

alexlarsson commented 1 year ago

I think we should try just linking prepare-root to glib, and then include in the required headers and c files from libostree/ directly into the binary. Seems least work and most likely to work.

ericcurtin commented 1 year ago

Sometimes I wish initramfs was like a really simple persistent filesystem (like ext4/vfat or even smaller) with a throwaway overlayfs rather than a tmpfs... That way minimising initramfs would be less of a consideration...

alexlarsson commented 1 year ago

Its not just GPG btw, other dependencies we may not want are: curl, libarchive, libssh, Also, I don't think we can just disable GPG in the build. You may need GPG for e.g. flatpak in the system libostree, even though you don't use it in the initrd.

dbnicholson commented 1 year ago

I think we should try just linking prepare-root to glib, and then include in the required headers and c files from libostree/ directly into the binary. Seems least work and most likely to work.

A less adhoc way to do that would be to split part of libostree into a static libotcore (or something) like:

noinst_LTLIBRARIES += libotcore.la
libotcore_la_SOURCES = <files split out from libostree_1_la_SOURCES>
libotcore_la_LIBADD = libglnx.la $(OT_INTERNAL_GIO_UNIX_LIBS)
libostree_1_la_LIBADD += libotcore.la
ostree_prepare_root_LDADD += libotcore.la
cgwalters commented 1 year ago

Sometimes I wish initramfs was like a really simple persistent filesystem (like ext4/vfat or even smaller) with a throwaway overlayfs rather than a tmpfs... That way minimising initramfs would be less of a consideration...

Nothing stops one from doing that at all; the initramfs can easily mount the target disk and load data from some other partition (or files stored in the ESP), and use that to mount the root. It just brings into question what tool manages that state...I think the inherent tradeoffs in adding a new thing aren't worth it.

tkfu commented 1 year ago

@cgwalters we at Toradex are working on this on embedded systems (and by "this", I mean HAB with integrity checks through to the rootfs using ostree+composefs). Are you guys still having regular sync calls on this topic? We'd like to join if possible, and I think we can provide some insight into embedded use cases that could be helpful.

iho commented 1 year ago

Hi everyone! I used ostree config --repo=/ostree/repo set ex-integrity.composefs true but now on every command I receive this kind of error every time when I use ostree or rpm-ostree ostree config --repo=/ostree/repo set ex-integrity.composefs false error: opening repo: composefs required, but libostree compiled without support Is there any way to fix it? I am new to immutable Fedora

https://ostreedev.github.io/ostree/composefs/ this page sends to this issue.

cgwalters commented 1 year ago

@iho Just run ostree config --repo=/ostree/repo set ex-integrity.composefs false ...

cgwalters commented 1 year ago

@tkfu

@cgwalters we at Toradex are working on this on embedded systems (and by "this", I mean HAB with integrity checks through to the rootfs using ostree+composefs). Are you guys still having regular sync calls on this topic? We'd like to join if possible, and I think we can provide some insight into embedded use cases that could be helpful.

Definitely, though I think it's usually most efficient to have high level discussions asynchronously and reserve realtime discussion for resolving "contentious" or difficult issues, as we did with the signature support thread.

If you see something not convered, feel free to just comment here; if it's big enough we can open a dedicated tracking issue for it too.

cgwalters commented 1 year ago

Reminder the meeting is today at 3:30pm CEST, 9:30am EST in https://meet.jit.si/moderated/2e9be89e0e9ee06647b4719784578a6251f72eec9a07829bc9212e57c4883816 cc @giuseppe - can you attend?

giuseppe commented 1 year ago

yes, I will attend

hsiangkao commented 1 year ago

(Sorry I missed the last meeting since I'm not quite good at oral english and missed the accurate time. I could attend on demand if something is on me if needed.)

ericcurtin commented 1 year ago

Sometimes I wish initramfs was like a really simple persistent filesystem (like ext4/vfat or even smaller) with a throwaway overlayfs rather than a tmpfs... That way minimising initramfs would be less of a consideration...

Nothing stops one from doing that at all; the initramfs can easily mount the target disk and load data from some other partition (or files stored in the ESP), and use that to mount the root. It just brings into question what tool manages that state...I think the inherent tradeoffs in adding a new thing aren't worth it.

So I couldn't help but try this, dunno if it's a wider thing for Fedora/CentOS Stream etc. But this could certainly prove useful for Automotive where boot time requirements are strict and more and more is being asked to be added to initramfs for early boot:

https://github.com/ericcurtin/initoverlayfs

It's not fully booting yet, must fix up some switchroot stuff, but the left are monotonic times for an initrd/initramfs and the right is monotonic times for an initoverlayfs:

image

iho commented 1 year ago

Any progress here?

ericcurtin commented 10 months ago

I got initoverlayfs booted without issue on Fedora (without ostree integration as of yet):

https://github.com/ericcurtin/initoverlayfs

The way it's implemented you can get faster boot times if you strip initramfs to just storage drivers, udev and a small C binary called pre-init in the repo, who's only role is to switch to initoverlayfs once the storage devices and drivers have been initialised and exec systemd. initoverlayfs contains all the contents of a "fat" initramfs. It means you can scale initoverlayfs to any size without degrading boot time:

image

if we did integrate this with ostree, it means you could consider even higher level abstractions in prepare-root than glib, like Rust and/or C++ without degrading boot time because of increased binary size.

cgwalters commented 10 months ago

@ericcurtin Personally I'd actually have tried to write this as one statically linked Rust binary, reimplementing the bit of udev we need, but anyways...let's track this as a separate issue?

Filed https://github.com/ostreedev/ostree/issues/3066

cgwalters commented 10 months ago

EDIT: to be clear, thanks for working on this!

ldts commented 3 months ago

@cgwalters Since the ostree repo can still be remounted r,w and therefore objects deleted or altered - and yes fs-verity will trigger a fault when accessing those files, aren't systems with ostree+composefs/fs-verity enabled still susceptible to denial of service attacks? more over if the functional flow of such a system depends on certain files being present (which usually means these files being accessible), by corrupting them after a deployment and rebooting we could be actually changing the system behavior. just checking if am I missing something fundamental.

I am testing my deployments and it all works as expected btw (ostree+composefs+fs-verity)

travier commented 3 months ago

aren't systems with ostree+composefs/fs-verity enabled still susceptible to denial of service attacks?

This is not an attack specific to composefs. See for example for Android using dm-verity:

An approach would be to read / access the entire filesystem tree on boot (maybe in the background) to completely verify the content of the image. This however has a cost.

It will always be up to distributions to make sure that applications fail securely if a file is missing / an IO error is returned.

ldts commented 3 months ago

An approach would be to read / access the entire filesystem tree on boot (maybe in the background) to completely verify the content of the image. This however has a cost.

It will always be up to distributions to make sure that applications fail securely if a file is missing / an IO error is returned.

thanks @travier , yes that makes sense.

cgwalters commented 3 months ago

Since the ostree repo can still be remounted r,w and therefore objects deleted or altered - and yes fs-verity will trigger a fault when accessing those files, aren't systems with ostree+composefs/fs-verity enabled still susceptible to denial of service attacks?

Yes, but any sufficiently privileged code can also just open the raw disk device and delete the partition table too. This is a bit of a nuanced issue and it depends on the threat model. In the case of "container runtime breakout" it's possible that the container may have sufficient privileges to do "remount in host mount namespace, + rm", but not to open raw block devices - which can actually be gated more strictly by a targeted LSM/SELinux policy.

But if the case is "container breakout got kernel mode code execution" and there's just no inherent way to stop a DoS in that scenario.

ericcurtin commented 3 months ago

@cgwalters Since the ostree repo can still be remounted r,w and therefore objects deleted or altered - and yes fs-verity will trigger a fault when accessing those files, aren't systems with ostree+composefs/fs-verity enabled still susceptible to denial of service attacks? more over if the functional flow of such a system depends on certain files being present (which usually means these files being accessible), by corrupting them after a deployment and rebooting we could be actually changing the system behavior. just checking if am I missing something fundamental.

I am testing my deployments and it all works as expected btw (ostree+composefs+fs-verity)

Note also if this results in boot failure, greenboot will rollback to last known healthy boot, now that boot could also get corrupted, but pretty much all verity techniques today are susceptible this sort of denial of service attack.

Even verity techniques that's don't rely on dm-verity/fs-verity such as Android Verified Boot and UKIs are also susceptible to this, change the bits in a UKI or Android Boot Image and the machine won't read/load it.

ricardosalveti commented 3 months ago

Yeah, even rolling back won't necessarily work on cases where you also want to protect against rollback attacks, so the only way would be to do similar as done by android, and reboot into a recovery mode of some sort (application / product specific).

ericcurtin commented 3 months ago

Yeah, even rolling back won't necessarily work on cases where you also want to protect against rollback attacks, so the only way would be to do similar as done by android, and reboot into a recovery mode of some sort (application / product specific).

Even recovery mode can get corrupted, it's kinda a never ending chain.

One feature OSTree has is that if you want more than AB rollbacks, you can in theory have as many rollbacks as you want ABCD, but just AB is common.

ldts commented 3 months ago

WRT to fs-verity triggering on detected issues, anyone knows why the kernel doesnt implement the config/option to just trigger a reboot on detection? just something around these lines - might be a bit more complex but this would be the idea - in a configurable way

diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index 80391c687c2ad..dbbec0a9c862c 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -991,8 +991,11 @@ int ovl_verify_lowerdata(struct dentry *dentry)
        int err;

        err = ovl_maybe_lookup_lowerdata(dentry);
-   if (err)
+ if (err) {
+         if (err == -ENOENT)
+                 BUG_ON(1);
                return err;
+ }

        return ovl_maybe_validate_verity(dentry);
 }
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 89e0d60d35b6c..fd039df0851d9 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -1309,6 +1309,7 @@ int ovl_validate_verity(struct ovl_fs *ofs,
                                          &verity_algo, NULL);
        if (digest_size == 0) {
                pr_warn_ratelimited("lower file '%pd' has no fs-verity digest\n", datapath->dentry);
+         BUG_ON(1);
                return -EIO;
        }

@@ -1317,6 +1318,7 @@ int ovl_validate_verity(struct ovl_fs *ofs,
            memcmp(metacopy_data.digest, actual_digest, xattr_digest_size) != 0) {
                pr_warn_ratelimited("lower file '%pd' has the wrong fs-verity digest\n",
                                    datapath->dentry);
+         BUG_ON(1);
                return -EIO;
        }
ericcurtin commented 3 months ago

@ldts could be a useful feature. Somthing you are interested in hacking on @ldts ?

I think this would have to be dynamically switchable on/off, it's also dangerous to randomly power off sometimes.

But maybe if in the boot path it's the right thing to do for example (even then it may not be, depends on your use-case), but maybe not after that?

How do you alert the user of this problem is another thing.

ldts commented 3 months ago

sure I wouldnt mind. perhaps getting @alexlarsson input first since he extended overlayfs with this support. will follow up but it seems to me that for the use case - so must be configurable- where the full file system demands integrity would be the right thing to do

ericcurtin commented 3 months ago

Might be worth considering if this would integrate with:

systemd-bsod

alexlarsson commented 3 months ago

I don't really think that is a good approach. For example, in a safety situtation, you might have some really important process running, and then some unimportant process hits a fs-verity issue, rebooting the systemd and stopping the important process. You might also be able to misuse this as a form of attack. I.e. loopback mount a file with a known incorrect fs-verity data to reboot the system.

What might be more useful is to have the option of having the process issuing the failing operation get a signal that kills the process, say SIGBUS or something like that.