kairos-io / kairos

The immutable Linux meta-distribution for edge Kubernetes.
https://kairos.io
Apache License 2.0
1.14k stars 97 forks source link

[Spike] Make UKI systemd sysext solution more robust #2630

Closed jimmykarily closed 1 month ago

jimmykarily commented 5 months ago

Instead of XBOOTLDR, we decided to go further in the path of systemd sysext}.

This ticket is to investigate the limit of this solution and find ways to support extensions as big as possible. We have reasons to believe that the system extension size is limited to a very low size (~500Mb) and we want to see if it's possible to load bigger ones.

Itxaka commented 5 months ago

Good test for this.

Hint: A way of skipping this would be to make so the agent does copy into the /EFI/kairos/active.efi.extra.d/ dir, as stub autoloads them from there. We could just store them under /EFI/kairos/sysext and make immucore move those sysext into /run/extensions on boot directly. That would skip the systemd-stub loading the files into memory to measure them. As we currently are NOT using those measurements for anything, it would be a quick workaround.

Itxaka commented 5 months ago

FYI I can confirm that the workaround works.

Manually copied the firmware.raw extension into /run/extensions and refreshed and magically all the firmware appeared correctly :D

Itxaka commented 5 months ago

lol, nice, this works on systemd-256 I was able to auto copy the firmware sysext and have it autoload on boot :D

jimmykarily commented 4 months ago

Before we close this, let's bump the sizes until things break, just to know what our limit is.

mudler commented 4 months ago

Depends on https://github.com/kairos-io/kairos/issues/2632

See also: #2595 https://github.com/kairos-io/packages/pull/916 https://github.com/kairos-io/packages/pull/919

jimmykarily commented 2 months ago

Just to clarify (because I was confused), the idea here is to test how big of a file can the systemd 256 manage before it breaks. If that limit is good for us, we don't need the workaround describe by Itxaka in the Hint: above. Or at least, we can advise people to implement it (with cloud config?) only if the 256 limits are not enough.

Let's find out what the systemd 256 limits are first (this ticket).

jimmykarily commented 2 months ago

For the following tests, I will be creating a sysext as described in the docs and I will install with a config like this (I'm bundling the sysext in the iso as described here :

#cloud-config

users:
  - name: kairos
    passwd: kairos

stages:
  after-install:
    - name: "Copy sysext"
      commands:
        - mkdir -p /tmp/efi
        - mount -L COS_GRUB -o rw /tmp/efi
        - cp /run/initramfs/live/my-extension.sysext.raw /tmp/efi/EFI/kairos/active.efi.extra.d/my-extension.sysext.raw

Update: turns out we've implemented the SysExtPostInstallHook that automatically copies from the livecd to the right location. The stages: part is not needed in the config above.

Test 1

# kairos-core-ubuntu-24.04 v3.1.2-6-g3f9c2330
root@localhost:/home/kairos# du -h /efi/EFI/kairos/active.efi.extra.d/my-extension.sysext.raw
512M    /efi/EFI/kairos/active.efi.extra.d/my-extension.sysext.raw
root@localhost:/home/kairos# free -h
               total        used        free      shared  buff/cache   available
Mem:           7.6Gi       2.9Gi       4.7Gi       2.6Gi       2.7Gi       4.6Gi
Swap:             0B          0B          0B
root@localhost:/home/kairos# du -h /usr/local/bin/bigfile
500M    /usr/local/bin/bigfile
root@localhost:/home/kairos# systemctl --version
systemd 255 (255.4-1ubuntu8.4)
+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified

(500Mb sysext loads fine with 8Gb RAM - systemd 255)

For the following tests I will remount the efi partition as rw, replace the sysext with a bigger one and reboot.

jimmykarily commented 2 months ago

Test 2:

same setup, sysext size 1G, RAM 8G:

image

it seems to be a limit of the firmware in qemu (or systemd limitation) and not the RAM size.

jimmykarily commented 2 months ago

Test 3:

Installation on tumbleweed (systemd 256) shows some failures:

2024-09-11T14:53:04Z INF Encrypting COS_OEM
2024-09-11T14:53:07Z DBG running command args="--tpm2-public-key=/run/systemd/tpm2-pcr-public-key.pem --tpm2-public-key-pcrs=11 --tpm2-pcrs= --tpm2-signature=/run/systemd/tpm2-pcr-signature.json --tpm2-device-key=/run/systemd/tpm2-srk-public-key.tpm2b_public /dev/vda2"
2024-09-11T14:53:07Z DBG debug from cryptenroll output="Failed to find TPM2 pcrlock policy file 'pcrlock.json': No such file or directory\nLoaded 'libcryptsetup.so.12' via dlopen()\nAllocating context for crypt device /dev/vda2.\nTrying to open and read device /dev/vda2 with direct-io.\nInitialising device-mapper backend library.\nTrying to load LUKS2 crypt type from device /dev/vda2.\nCrypto backend (OpenSSL 3.1.4 24 Oct 2023 [default][legacy] [external libargon2]) initialized in cryptsetup library version 2.7.4.\nDetected kernel Linux 6.10.5-1-default x86_64.\nLoading LUKS2 header (repair disabled).\nAcquiring read lock for device /dev/vda2.\nOpening lock resource file /run/cryptsetup/L_254:2\nVerifying lock handle for /dev/vda2.\nDevice /dev/vda2 READ lock taken.\nTrying to read primary LUKS2 header at offset 0x0.\nOpening locked device /dev/vda2\nVerifying locked device handle (bdev)\nLUKS2 header version 2 of size 16384 bytes, checksum sha256.\nChecksum:c1a4a6a662b1dffa6e53624aa80f2efdc6a1410b015482e84c464d34ecc08bea (on-disk)\nChecksum:c1a4a6a662b1dffa6e53624aa80f2efdc6a1410b015482e84c464d34ecc08bea (in-memory)\nTrying to read secondary LUKS2 header at offset 0x4000.\nReusing open ro fd on device /dev/vda2\nLUKS2 header version 2 of size 16384 bytes, checksum sha256.\nChecksum:483c2aa287ca4771cd6b6c84480cc56cdab8b177182ad8345599a85ce64e886d (on-disk)\nChecksum:483c2aa287ca4771cd6b6c84480cc56cdab8b177182ad8345599a85ce64e886d (in-memory)\nDevice size 67108864, offset 16777216.\nDevice /dev/vda2 READ lock released.\nOnly 3 active CPUs detected, PBKDF threads decreased from 4 to 3.\nNot enough physical memory detected, PBKDF max memory decreased from 1048576kB to 576620kB.\nPBKDF argon2id, time_ms 2000 (iterations 0), max_memory_kb 576620, parallel_threads 3.\nRequesting JSON for token 0.\nRequesting JSON for token 1.\nRequesting JSON for token 2.\nRequesting JSON for token 3.\nRequesting JSON for token 4.\nRequesting JSON for token 5.\nRequesting JSON for token 6.\nRequesting JSON for token 7.\nRequesting JSON for token 8.\nRequesting JSON for token 9.\nRequesting JSON for token 10.\nRequesting JSON for token 11.\nRequesting JSON for token 12.\nRequesting JSON for token 13.\nRequesting JSON for token 14.\nRequesting JSON for token 15.\nRequesting JSON for token 16.\nRequesting JSON for token 17.\nRequesting JSON for token 18.\nRequesting JSON for token 19.\nRequesting JSON for token 20.\nRequesting JSON for token 21.\nRequesting JSON for token 22.\nRequesting JSON for token 23.\nRequesting JSON for token 24.\nRequesting JSON for token 25.\nRequesting JSON for token 26.\nRequesting JSON for token 27.\nRequesting JSON for token 28.\nRequesting JSON for token 29.\nRequesting JSON for token 30.\nRequesting JSON for token 31.\nKeyslot 0 priority 1 != 2 (required), skipped.\nTrying to open LUKS2 keyslot 0.\nRunning keyslot key derivation.\nReading keyslot area [0x8000].\nAcquiring read lock for device /dev/vda2.\nOpening lock resource file /run/cryptsetup/L_254:2\nVerifying lock handle for /dev/vda2.\nDevice /dev/vda2 READ lock taken.\nReusing open ro fd on device /dev/vda2\nDevice /dev/vda2 READ lock released.\nVerifying key from keyslot 0, digest 0.\nLoaded 'libtss2-esys.so.0' via dlopen()\nLoaded 'libtss2-rc.so.0' via dlopen()\nLoaded 'libtss2-mu.so.0' via dlopen()\nFailed to read device key from file '/run/systemd/tpm2-srk-public-key.tpm2b_public': No such file or directory\nReleasing crypt device /dev/vda2 context.\nReleasing device-mapper backend.\nClosing read only fd for /dev/vda2.\n"
2024-09-11T14:53:07Z ERR Enrolling measurements error="exit status 1"
2024-09-11T14:53:07Z ERR could not encrypt partition: exit status 1

but finishes. Eventually I get the same systemd-boot error with 10G of RAM and 1G sysext.

Update: I used one of our released images to build the uki ISO and the result is a 2.9Gb iso. That's probalby the issue and not the sysext. I will try to build something with systemd 256 which can be small enough to boot.

Itxaka commented 2 months ago

Tumbleweed is a bit iffy if I recall my tests. I would recommend not testing in there. Even fedora 40 has some bugs so Ubuntu is the safe choice for uki

jimmykarily commented 2 months ago

Tumbleweed is a bit iffy if I recall my tests. I would recommend not testing in there. Even fedora 40 has some bugs so Ubuntu is the safe choice for uki

I was trying to get a system with systemd 256 but then I realised we now have a packages :facepalm: I will try again with ubuntu, I'm not sure how I ended up with system 255. I probably used the wrong image to build the uki iso.

jimmykarily commented 2 months ago

Tumbleweed is a bit iffy if I recall my tests. I would recommend not testing in there. Even fedora 40 has some bugs so Ubuntu is the safe choice for uki

I was trying to get a system with systemd 256 but then I realised we now have a packages 🤦 I will try again with ubuntu, I'm not sure how I ended up with system 255. I probably used the wrong image to build the uki iso.

I ended up with 255 because we don't consume the package yet. I will do hacks to consume it and see.

jimmykarily commented 2 months ago

Tumbleweed is a bit iffy if I recall my tests. I would recommend not testing in there. Even fedora 40 has some bugs so Ubuntu is the safe choice for uki

I was trying to get a system with systemd 256 but then I realised we now have a packages 🤦 I will try again with ubuntu, I'm not sure how I ended up with system 255. I probably used the wrong image to build the uki iso.

I ended up with 255 because we don't consume the package yet. I will do hacks to consume it and see.

nope, we just build the systemd-boot files. @Itxaka what part of systemd 256 was expected to improve things and in which way? The only reliable flavor for uki isos is Ubuntu and that one doesn't have systemd 256, so I'm wondering, why is version 256 considered important here?

My tests so far have shown that with systemd 255 in qemu, we hit the firmware limit somewhere around 500Mb.

Itxaka commented 2 months ago

Tumbleweed is a bit iffy if I recall my tests. I would recommend not testing in there. Even fedora 40 has some bugs so Ubuntu is the safe choice for uki

I was trying to get a system with systemd 256 but then I realised we now have a packages 🤦 I will try again with ubuntu, I'm not sure how I ended up with system 255. I probably used the wrong image to build the uki iso.

I ended up with 255 because we don't consume the package yet. I will do hacks to consume it and see.

nope, we just build the systemd-boot files. @Itxaka what part of systemd 256 was expected to improve things and in which way? The only reliable flavor for uki isos is Ubuntu and that one doesn't have systemd 256, so I'm wondering, why is version 256 considered important here?

My tests so far have shown that with systemd 255 in qemu, we hit the firmware limit somewhere around 500Mb.

The loading I guess but maybe I screwed up and mixed versions? Or increased ram somehow and that made it work?

Maybe we should just ignore the system version and bump memory to see if the loading is linear (i.e. with double the ram we can load double size sysext) so we can write that down?

jimmykarily commented 2 months ago

Tumbleweed is a bit iffy if I recall my tests. I would recommend not testing in there. Even fedora 40 has some bugs so Ubuntu is the safe choice for uki

I was trying to get a system with systemd 256 but then I realised we now have a packages 🤦 I will try again with ubuntu, I'm not sure how I ended up with system 255. I probably used the wrong image to build the uki iso.

I ended up with 255 because we don't consume the package yet. I will do hacks to consume it and see.

nope, we just build the systemd-boot files. @Itxaka what part of systemd 256 was expected to improve things and in which way? The only reliable flavor for uki isos is Ubuntu and that one doesn't have systemd 256, so I'm wondering, why is version 256 considered important here? My tests so far have shown that with systemd 255 in qemu, we hit the firmware limit somewhere around 500Mb.

The loading I guess but maybe I screwed up and mixed versions? Or increased ram somehow and that made it work?

Maybe we should just ignore the system version and bump memory to see if the loading is linear (i.e. with double the ram we can load double size sysext) so we can write that down?

My example above shows that 500Mb sysext fit in 8G of RAM but 1G of sysext couldn't run in a 18Gb RAM vm, so it's not linear.

Itxaka commented 2 months ago

Tumbleweed is a bit iffy if I recall my tests. I would recommend not testing in there. Even fedora 40 has some bugs so Ubuntu is the safe choice for uki

I was trying to get a system with systemd 256 but then I realised we now have a packages 🤦 I will try again with ubuntu, I'm not sure how I ended up with system 255. I probably used the wrong image to build the uki iso.

I ended up with 255 because we don't consume the package yet. I will do hacks to consume it and see.

nope, we just build the systemd-boot files. @Itxaka what part of systemd 256 was expected to improve things and in which way? The only reliable flavor for uki isos is Ubuntu and that one doesn't have systemd 256, so I'm wondering, why is version 256 considered important here? My tests so far have shown that with systemd 255 in qemu, we hit the firmware limit somewhere around 500Mb.

The loading I guess but maybe I screwed up and mixed versions? Or increased ram somehow and that made it work?

Maybe we should just ignore the system version and bump memory to see if the loading is linear (i.e. with double the ram we can load double size sysext) so we can write that down?

My example above shows that 500Mb sysext fit in 8G of RAM but 1G of sysext couldn't run in a 18Gb RAM vm, so it's not linear.

Oh well, that sucks. So maybe there is a max allocated to the EFI implementation that we cant go over.

Did you try a 800mb one? IIRC we had the same with EFI files, a max of 1Gb so it makes sense that it would be similar

jimmykarily commented 2 months ago

Tumbleweed is a bit iffy if I recall my tests. I would recommend not testing in there. Even fedora 40 has some bugs so Ubuntu is the safe choice for uki

I was trying to get a system with systemd 256 but then I realised we now have a packages 🤦 I will try again with ubuntu, I'm not sure how I ended up with system 255. I probably used the wrong image to build the uki iso.

I ended up with 255 because we don't consume the package yet. I will do hacks to consume it and see.

nope, we just build the systemd-boot files. @Itxaka what part of systemd 256 was expected to improve things and in which way? The only reliable flavor for uki isos is Ubuntu and that one doesn't have systemd 256, so I'm wondering, why is version 256 considered important here? My tests so far have shown that with systemd 255 in qemu, we hit the firmware limit somewhere around 500Mb.

The loading I guess but maybe I screwed up and mixed versions? Or increased ram somehow and that made it work? Maybe we should just ignore the system version and bump memory to see if the loading is linear (i.e. with double the ram we can load double size sysext) so we can write that down?

My example above shows that 500Mb sysext fit in 8G of RAM but 1G of sysext couldn't run in a 18Gb RAM vm, so it's not linear.

Oh well, that sucks. So maybe there is a max allocated to the EFI implementation that we cant go over.

Did you try a 800mb one? IIRC we had the same with EFI files, a max of 1Gb so it makes sense that it would be similar

I will try more sizes and record the findings here. In any case, these are all based on qemu so the actual numbers don't really matter. If there is a limit, that limit comes from the firmware so it's value will vary. We can only gather some data points here to know what to expect (e.g. limit is the same and the OS image?). Let's see.

jimmykarily commented 2 months ago

800Mb sysext loads fine (18Gb RAM, ubuntu 24.04 VM with systemd 255):

kairos@localhost:~$ du -h /usr/local/bin/bigfile
800M    /usr/local/bin/bigfile

I reduced the RAM to 8G and rebooted and it still loads. RAM doesn't seem to be the limiting factor here.

Update: 900Mb sysext doesn't load (fails with the usual error).

jimmykarily commented 2 months ago

Another test:

I created 2 system extensions 700Mb each. As shown above, each one would otherwise load fine. I gave the VM 10G of RAM (although 8G was enough to load a sysext of 800Mb). The VM didn't boot (got the regular error).

So they extension size adds up. Splitting the files we want into separate extensions won't do the trick.

Itxaka commented 2 months ago

Another test:

I created 2 system extensions 700Mb each. As shown above, each one would otherwise load fine. I gave the VM 10G of RAM (although 8G was enough to load a sysext of 800Mb). The VM didn't boot (got the regular error).

So they extension size adds up. Splitting the files we want into separate extensions won't do the trick.

umm, that makes sense. It generates a cpio archive on the fly if I remember the code correctly and passes it to the kernel/initramfs, which then unpacks it into the initramfs /.extra/ dir. So yeah, it can run out of memory if the sum of the sysext files is over the mx allocated pages limit. Oh well. not much we can do other than the workaround if we reach that point and that one would be easy to implement, but we lose the measurements (or we do it manually?)

Actually we may be able to do the measurements of the sysextensions directly on immucore while we copy them from the EFI partition into the /run/systemd/extensions as we have a way of extending PCRs directly from immucore.....something to have in mind.

jimmykarily commented 1 month ago

closing as spike is done