kairos-io / kairos

The immutable Linux meta-distribution for edge Kubernetes.
https://kairos.io
Apache License 2.0
1.16k stars 96 forks source link

Implement systemd-boot boot assessment #2864

Open jimmykarily opened 2 months ago

jimmykarily commented 2 months ago

systemd-boot has a was to perform boot assessment and fallback to other entries if booting fails. It is described in detail here and here. It's not very complicated and only requires us to name the conf/efi files in a certain way and also make sure we order entries properly (so that the right one is picked as a fallback).

Note: Originally investigated while documenting how Kairos does boot assessment,

bencorrado commented 1 month ago

I can help test this when someone is ready for testing.

I was also thinking about how does the system move from failed active AND passive into recovery or reset.

Right now recovery requires human intervention and doesn't load any sysext options, so it has to be pretty bare bones as we are keeping UKI images small. I was thinking about building an auto update script for recovery that runs and tries to fix active/passive by running an upgrade and/or checks a HTTPS website for instructions. It would then not auto update the systemd-boot count for recovery, and instead let active/passive successfully booting reset the count for recovery. This would make sure that if recovery fails to recover the system after X attempts, a reset is triggered which hopefully can do a better job setting every right and blowing away filesystems to clean it up.

jimmykarily commented 1 month ago

Planning decision:

Let's implement the default fallback mechanism of systemd first and then see if we can implement the auto-reset feature using stages and such (extract to different ticket when the first part is done)

Being able to auto-reset a system that doesn't boot make sense, especially in cases like:

Itxaka commented 4 days ago

with the given patch it seems to work BUT

2 possible outcomes:

thoughts @kairos-io/maintainers

Itxaka commented 4 days ago

Basically this is the expected workflow of the boot assesment for reference: https://systemd.io/AUTOMATIC_BOOT_ASSESSMENT/

Important part below

Let’s say the second boot succeeds. The kernel initializes properly, systemd is started and invokes all generators.

One of the generators started is systemd-bless-boot-generator which detects that boot counting is used.
It hence pulls systemd-bless-boot.service into the initial transaction.

systemd-bless-boot.service is ordered after and Requires= the generic boot-complete.target unit.
 This unit is hence also pulled into the initial transaction.

The boot-complete.target unit is ordered after and pulls in various units that are required to succeed for the boot process to be considered successful. 
One such unit is systemd-boot-check-no-failures.service.

systemd-boot-check-no-failures.service is run after all its own dependencies completed, and assesses that the boot completed successfully. It hence exits cleanly.

This allows boot-complete.target to be reached. This signifies to the system that this boot attempt shall be considered successful.

Which in turn permits systemd-bless-boot.service to run. It now determines which boot loader entry file was used to boot the system, and renames it dropping the counter tag. Thus 4.14.11-300.fc27.x86_64+1-2.conf is renamed to 4.14.11-300.fc27.x86_64.conf. From this moment boot counting is turned off for this entry.
Itxaka commented 4 days ago

Mount EFI as RW during initramfs, remount it as RO at the end of the UKI boot process

I dont think this works for us, as we need to wait for the boot-complete.target which will happen in userspace instead of initramfs.

We could also have a manual service that runs after systemd multi-user.target

mudler commented 4 days ago

with the given patch it seems to work BUT

* we are missing the systemd-bless-boot service and binary which changes the tries left/used so after 3 boots the entries are marked as bad

* even if we make that work, it will not work because we mount the efi partition RO

2 possible outcomes:

* Mount EFI as RW during initramfs, remount it as RO at the end of the UKI boot process

mmh complex but doable, the only challenge I see there is to fire the systemd services exactly in that timeframe, not sure if possible if not by calling systemd-bless-boot inside immucore

* Create our own service that remounts as RW, changes the current entry (mark good basically) and remounts RO

That looks the most saner solution at this point, however, my only concern here is if systemd-bless-boot will get more business logic from systemd that we might miss. Wouldn't be at this point equivalent to call systemd-bless-boot from immucore directly?

Itxaka commented 4 days ago

mmh complex but doable, the only challenge I see there is to fire the systemd services exactly in that timeframe, not sure if possible if not by calling systemd-bless-boot inside immucore

Yeah after a deeper checking this wont work as the bless is once the system is fully up, so in userspace once systemctl reports everything as running. Out of immucore control unfortunately

* Create our own service that remounts as RW, changes the current entry (mark good basically) and remounts RO

That looks the most saner solution at this point, however, my only concern here is if systemd-bless-boot will get more business logic from systemd that we might miss. Wouldn't be at this point equivalent to call systemd-bless-boot from immucore directly?

Seems like we may be able to do it ourselves by just calling the binary. So mimicking the bless service but with extra steps. Maybe even with a simple override to run pre and post for the mounts. So we dont need to reimplement the whole thing

bencorrado commented 4 days ago

Maybe even with a simple override to run pre and post for the mounts. So we don't need to reimplement the whole thing

That was exactly what I was thinking. We need to modify the path for systemd-bless-boot anyway since we don't use /boot

Maybe changing systemd-bless-boot.service with an override file to have something like:

[Service]
# Remount /efi as read-write before starting the main service
ExecStartPre=/usr/bin/mount -o remount,rw /efi

# Modify ExecStart to include --path=/efi
ExecStart=/usr/bin/systemd-bless-boot good --path=/efi

# Remount /efi as read-only after the service completes
ExecStartPost=/usr/bin/mount -o remount,ro /efi
Itxaka commented 4 days ago

Maybe even with a simple override to run pre and post for the mounts. So we don't need to reimplement the whole thing

That was exactly what I was thinking. We need to modify the path for systemd-bless-boot anyway since we don't use /boot

Maybe changing systemd-bless-boot.service with an override file to have something like:

[Service]
# Remount /efi as read-write before starting the main service
ExecStartPre=/usr/bin/mount -o remount,rw /efi

# Modify ExecStart to include --path=/efi
ExecStart=/usr/bin/systemd-bless-boot good --path=/efi

# Remount /efi as read-only after the service completes
ExecStartPost=/usr/bin/mount -o remount,ro /efi

I actually tested this with overrides for mounting unmounting the partition and it worked as expected. I think it gets the path automatically either from identifying the partition type or from the systemd-boot efivars but it do actually works as expected

Itxaka commented 3 days ago

With this overrider the boot-bless service works

### /etc/systemd/system/systemd-bless-boot.service.d/override.conf

[Service]

ExecStartPre=mount -o remount,rw /efi
ExecStartPost=mount -o remount,ro /efi

Notice that we also need to override another service, the boot-random-seed as that its automatically brought and needs write access to efi

### /etc/systemd/system/systemd-boot-random-seed.service.d/override.conf

[Service]

ExecStartPre=mount -o remount,rw /efi
ExecStartPost=mount -o remount,ro /efi
Itxaka commented 3 days ago

there is still an issue but we can workaround it with this

[Service]

ExecStartPre=mount -o remount,rw /efi
ExecStartPost=sed -i -E 's/(default\s+)*\+[0-9]+(-[0-9]+)?(\.conf)/\1\3/' /efi/loader/loader.conf
ExecStartPost=mount -o remount,ro /efi

So on our loader.conf we set the specific config that we want to run, so for example active.conf. With boot assessment this is automatically set to something like active+3.conf

The main problem is, that when bless-boot marks a config as good after booting, it renames it to remove the boot assessment, as its marked as good, so active+3.conf turns into active.conf. But the loader.conf is not updated, so its still pointing to active+3.conf which doesnt match the actual config. There is glob support in the default stanza, but that its not good enough in the case we have extra efis with different cmdlines as we want to match the name or the name+boot assessment not a greedy match which could lead to picking activeBad.conf

So to fix that, we can use the service itself to remove any mentions of the boot assessment part in the loader.conf with sed :D

I tested this with an active+3.conf which turns into active+2-1.conf on the first boot due how assesment works, then bless-boot triggered and marked it as good, changing the conf to active.conf. Then sed removed the +3 part from the loader.conf entry correctly.

I think we can work with this. I will test it further but seems to work as expected.

Moving pieces needed to fully implement this: