kairos-io / kairos

The immutable Linux meta-distribution for edge Kubernetes.
https://kairos.io
Apache License 2.0
1.16k stars 96 forks source link

v2.4.0 grub error out of memory #1842

Closed Ognian closed 11 months ago

Ognian commented 1 year ago

After install from kairos-standard-opensuse-leap-amd64-generic-v2.4.0-k3sv1.26.6+k3s1.iso on /dev/mmcblk1 on a x86_64 (latte panda 3 d) I get immediately the following grub error:

image
Ognian commented 1 year ago

same for kairos-standard-opensuse-tumbleweed-amd64-generic-v2.4.0-k3sv1.26.6+k3s1.iso

Itxaka commented 1 year ago

umm, this could be related to the gfx set by grub, you may need to set it to lower manually as we now set the gfxterm teminal to auto and it would try to get the highest mode available.

Maybe you can check with different gfxmode values?

Itxaka commented 1 year ago

https://www.gnu.org/software/grub/manual/grub/html_node/gfxmode.html

Itxaka commented 1 year ago

seem like elementary also hit this at one point, which seems to confirm that this is a gfx issue, setting a really high gfx setting but the framebuffer is not big enough to display that: https://github.com/elementary/installer/issues/542

Ognian commented 1 year ago

trying to change gfxmode from auto to 640x480, but it is wired:

image

It indeed changes something but actually where to do the change? or is it needed multiple times?

And actually why does it work from the usb stick and not after installing? I thought that the grub config is identical...

Ognian commented 1 year ago

@Itxaka any news on this, any chance to be fixed in 2.4.1?

Itxaka commented 1 year ago

@Ognian unfortunately no. As this requires a change to grub default values, we needed to push 2.4.1 to fix some issues before getting to work into this as it requires extensive testing to find a good default.

Ognian commented 1 year ago

Tested with 2.4.1 same issue! Noticed the following:

image
Itxaka commented 1 year ago

Tested with 2.4.1 same issue! Noticed the following: image

Wait, so this means you are able to boot by manually setting the gfxmode rigth? But then on reboot it ignores it unless you set it manually?

Seems like we need to look for a safe default for the resolution

Those are just warnings being exposed. It happened before but we were not logging them properly, it should not affect that much, is just nicer to have those fonts bundled :)

Ognian commented 1 year ago

I'll describe the process from the beginning:

  1. I'm downloading kairos-standard-opensuse-leap-amd64-generic-v2.4.1-k3sv1.26.6+k3s1.iso and burning it to an usb stick
  2. Im inserting the stick and booting from it (latte panda delta 3 -> x86_64 with build in eMMC). Stick is booting and I'm getting the qr code.
  3. I'm using the webui (ip:8080) to install on the build in eMMC (/dev/mmcblk1), pasting my cloud_config and checking reboot
  4. When it restarts, I remove the usb stick so it tries to boot from the eMMC (sd card). Here the out of memory error of grub appears

the grub.cfg on the USB stick is much shorter than the one written by the installer on the eMMC (= sd card). the grub configuration on the usb stick always works the one on the sd card never.

I tried to modify the one on the sd card by inserting set gfxmode=640x480 at different places, it changes the behavior BUT none of the attempts lead to booting kairos...

AndreyNikiforov commented 1 year ago

Also faced out of memory (OOM) issues when trying to install on old Acer Aspire1 laptop (4G ram & mmc). Bisected to loopback command that cases OOM. Suggestions on SO are to copy kernel & initrc from image to disk. Don't have progress as I am still learning grub...

AndreyNikiforov commented 1 year ago

Also faced out of memory (OOM) issues when trying to install on old Acer Aspire1 laptop (4G ram & mmc). Bisected to loopback command that cases OOM. Suggestions on SO are to copy kernel & initrc from image to disk. Don't have progress as I am still learning grub...

Enabling debugging with set debug=all let me pass through loopback - different error (not OOM). In debug I noticed that tpm module is used, so I turned off TPM in BIOS and kairos started successfully. Although I am unblocked, it is not clear what was the root cause. If it was indeed the lack of memory and TPM use just crossed a bar, then reducing memory foot print makes sense: use text mode by default, test with large images etc

Itxaka commented 1 year ago

I'll describe the process from the beginning:

1. I'm downloading `kairos-standard-opensuse-leap-amd64-generic-v2.4.1-k3sv1.26.6+k3s1.iso` and burning it to an usb stick

2. Im inserting the stick and booting from it (latte panda delta 3 -> x86_64 with build in eMMC). Stick is booting and I'm getting the qr code.

3. I'm using the webui (ip:8080) to install on the build in eMMC (/dev/mmcblk1), pasting my cloud_config and checking reboot

4. When it restarts, I remove the usb stick so it tries to boot from the eMMC (sd card). Here the out of memory error of grub appears

the grub.cfg on the USB stick is much shorter than the one written by the installer on the eMMC (= sd card). the grub configuration on the usb stick always works the one on the sd card never.

I tried to modify the one on the sd card by inserting set gfxmode=640x480 at different places, it changes the behavior BUT none of the attempts lead to booting kairos...

yep, this makes sense. Our grub.cfg for livecd does not have the gfxmode set, so it makes sense that on livecd/usb/live mode you do not hit this, its only once you restart from the installed system, then you hit this issue as we set the set gfxmode=auto

Let me test this somehow. Maybe I can make virtualbox reproduce it by setting the video card to a very low amount of ram or something similar....

Itxaka commented 1 year ago

Also faced out of memory (OOM) issues when trying to install on old Acer Aspire1 laptop (4G ram & mmc). Bisected to loopback command that cases OOM. Suggestions on SO are to copy kernel & initrc from image to disk. Don't have progress as I am still learning grub...

Enabling debugging with set debug=all let me pass through loopback - different error (not OOM). In debug I noticed that tpm module is used, so I turned off TPM in BIOS and kairos started successfully. Although I am unblocked, it is not clear what was the root cause. If it was indeed the lack of memory and TPM use just crossed a bar, then reducing memory foot print makes sense: use text mode by default, test with large images etc

very weird, 4Gb of ram should be more than enough for everything to load with no issues, after all the kernel and initrd cant be more than 200Mb in any of the flavors....

Wondering if its due to the modules or the gfx stuff in your case as well....

Ognian commented 1 year ago

So I disabled TPM from BIOS (Thanks @AndreyNikiforov !) I did a clean install of 2.4.1 from USB. On first boot of the internal eMMC:

image image

pressed a key, booting continuous

image image image

The above errors don't look scary to see... After this it looks like it works...

Itxaka commented 1 year ago

Some comments found going trougth teh grub bugtracker:

Finally I found a comment regarding the screen size and GRUB. Apparently the 4k graphics size eats half the available 200MB RAM from GRUBs allotment. Thus any initrd.img larger than 100MB won't load.

Looks like TPM module is indeed involved! https://github.com/rhboot/grub2/pull/102

So https://github.com/rhboot/grub2/commit/635f85b016839b9aaecdecee69a2ee98edb3e0ab was supposed to allow initrds to be allocated over 4GB. However, initrds are also being verified by the verifiers framework, or rather the tpm "verifier" measures them this way.

This causes the verifiers framework to read the entire file into memory first using standard memory allocation to verify it and then release it again before our allocator gets a chance to load the size and allocate it. This is um bad.

So it makes sense that disabling tpm makes it work as it doesnt try to fully load the initrd into memory for measure.

So it seems to be a mix of several things:

HAve to think about this and check further in upstream grubs to see if this has been fixed somewhere but good catch folks.

Thanks @Ognian for reporting this and @AndreyNikiforov for the hint with the TPM. This would have been a nigthmare to track down otherwise!

Itxaka commented 1 year ago

our kernel on core images is around 13Mb our initrd on core images is around 92/96Mb

It kind of makes sense that we go over that mentioned 100Mb by setting the gfx mode to auto if it choses a very high resolution....

Itxaka commented 1 year ago

By moving to compressing the initramfs with zstd it would gain us 4 extra Mb, which is not much, but its good enough to breathe I guess

@Ognian does this happen with a non-k3s build? If it also happens, are you able to build a custom image with the --zstd flag on initrd creation to see if it alleviates the issue?

The patch is as follows, its just 1 line:

diff --git a/Earthfile b/Earthfile
index b22b8c8..61eb545 100644
--- a/Earthfile
+++ b/Earthfile
@@ -441,7 +441,7 @@ base-image:
       IF [ -e "/usr/bin/dracut" ]
           # Regenerate initrd if necessary
           RUN --no-cache kernel=$(ls /lib/modules | head -n1) && depmod -a "${kernel}"
-          RUN --no-cache kernel=$(ls /lib/modules | head -n1) && dracut -f "/boot/initrd-${kernel}" "${kernel}" && ln -sf "initrd-${kernel}" /boot/initrd
+          RUN --no-cache kernel=$(ls /lib/modules | head -n1) && dracut --zstd -f "/boot/initrd${kernel}" "${kernel}" && ln -sf "initrd-${kernel}" /boot/initrd
       END
     END

And then simply run earthly +iso --FLAVOR=opensuse-leap --VARIANT=standard --K3S_VERSION=v1.26.6 to generate an iso under build

Itxaka commented 1 year ago

umm booting from master in 4k doesnt result in the issue being reproduced, even with tpm. Im wondering if its a tpm implementation issue rather than a grub one. We dont ship the tpm module with grub as a module so not sure if its integrated into grub directly.

I think we need to rework the grub.cfg to not load the gfxterm for now unless its needed as its giving us a lot of headaches.

jimmykarily commented 1 year ago

We dropped gfxterm here: https://github.com/kairos-io/packages/pull/473 . Please give it a try if the problem still occurs feel free to re-open.

jeffmhastings commented 1 year ago

I'm running into the same problem using kairos-standard-ubuntu-22-lts-amd64-generic-v2.4.1-k3sv1.27.3+k3s1.iso. I also built from master, thinking that would pull in the changes from https://github.com/kairos-io/packages/pull/473 (and I think it did because I my grub.cfg is now missing all the gfx stuff), but have the same result. I didn't have success disabling TPM either.

Edit: Disabling TPM and reinstalling gave me the same results as @Ognian (can't find regexp, boots after pressing a key). Anyway I'd definitely like to see this issue resolved (ideally without disabling TPM) so let me know if there's anything I can do to help.

mevatron commented 1 year ago

Just as another data point, I'm testing on a Microsoft Surface Pro 7+ and getting this on the latest ubuntu-20.04-v2.4.2 and still am seeing grub OOM. If there's anything I can help test, I'd be happy to!

jimmykarily commented 1 year ago

Up to now it seems that to reproduce this issue one needs:

and we still miss something because @Itxaka tried the above combination and couldn't reproduce. His test was on qemu with virtual monitors though so maybe that's the reason (but grub thought the resolution was 4k)

mevatron commented 1 year ago

Up to now it seems that to reproduce this issue one needs:

  • gfxmode set to auto
  • a 4k monitor (to make the above use a high resolution). Maybe 2k will also trigger it, not sure
  • a TPM chip on the machine
  • uefi booting

I've looked for a way to disable TPM on the Surface Pro, but I don't think that is an available setting in its boot menu. What's the best way to test setting the gfxmode to a lower resolution in Kairos?

jimmykarily commented 1 year ago

I would try this (warning: not tested):

Hopefully that should set the gfxmode on the installed system's grub. You can ofcourse check, after installation by editing the grub menu again and looking for that option.

tyzbit commented 1 year ago

I know you said to use the live CD but I rebooted a node and tried running videoinfo in the GRUB prompt, it said the command was not found. I tried different combinations of set gfxmode= and set gfxpayload= in the custom one-time GRUB options and none of them prevented the error. It also seemed like none of them changed the video. For what it's worth, here's my config

mevatron commented 12 months ago

I know you said to use the live CD but I rebooted a node and tried running videoinfo in the GRUB prompt, it said the command was not found. I tried different combinations of set gfxmode= and set gfxpayload= in the custom one-time GRUB options and none of them prevented the error. It also seemed like none of them changed the video. For what it's worth, here's my config

I noticed that videoinfo wasn't on the Kairos grub menu as well, but I downloaded the Ubuntu Server 22.04 ISO and that seemed to do the trick.

Unfortunately, lowering the resolution didn't work for me either =/ PXL_20231123_170856024 MP

jimmykarily commented 12 months ago

@santhoshdaivajna sent me on Slack that they are seeing the same issue on Intel NUC with 8 cpu/32G mem/>500G disk . We may be able to get access to a NUC to debug.

mudler commented 12 months ago

this reminds me https://bugs.launchpad.net/oem-priority/+bug/1842320/comments/125 - did we tried setting up gfxmode to 640x480 ?

mudler commented 12 months ago

also: https://bugs.launchpad.net/oem-priority/+bug/1842320

mudler commented 12 months ago

maybe it's just the GRUB version causing issues here? @Ognian is that new to 2.4? we could cross check the GRUB versions to see if that's causing it

Itxaka commented 12 months ago

we think that the tumbleweed grub efi binary is the responsible of this and have reverted the change to use the leap one on https://github.com/kairos-io/packages/pull/553

mevatron commented 11 months ago

@Itxaka Thanks for looking into this! Will this also help the ubuntu flavors, or is this specific to opensuse?

Itxaka commented 11 months ago

Should be for all, as we use the same grub artifacts for all of them

Ognian commented 11 months ago

Yes this was new with

maybe it's just the GRUB version causing issues here? @Ognian is that new to 2.4? we could cross check the GRUB versions to see if that's causing it

yes, this was newly introduced with 2.4. I just tested with 2.4.1 and upgraded to 2.4.2 and the result is with 2.4.2 as it was with 2.4.1 and 2.4.0: with TPM -> out of memory error; without TPM -> boots OK The last version I tested where it worked was v2.2.1 The version I have now is: KAIROS_PRETTY_NAME="kairos-standard-opensuse-leap-15.5 v2.4.2-k3sv1.28.2+k3s1" and sudo grub2-install --version grub2-install (GRUB2) 2.06

Hope this helps. Ognian

mevatron commented 11 months ago

Unfortunately, the Surface Pro 7+ doesn't allow TPM disable 😕 Is my next option switch dracut to hostonly=yes maybe? @Ognian are you running the grub2-install inside the the new container there?

Thanks!

On Sat, Dec 2, 2023, 11:26 AM Ognian @.***> wrote:

Yes this was new with

maybe it's just the GRUB version causing issues here? @Ognian https://github.com/Ognian is that new to 2.4? we could cross check the GRUB versions to see if that's causing it

yes, this was newly introduced with 2.4. I just tested with 2.4.1 and upgraded to 2.4.2 and the result is with 2.4.2 as it was with 2.4.1 and 2.4.0: with TPM -> out of memory error; without TPM -> boots OK The last version I tested where it worked was v2.2.1 The version I have now is: KAIROS_PRETTY_NAME="kairos-standard-opensuse-leap-15.5 v2.4.2-k3sv1.28.2+k3s1" and sudo grub2-install --version grub2-install (GRUB2) 2.06

Hope this helps. Ognian

— Reply to this email directly, view it on GitHub https://github.com/kairos-io/kairos/issues/1842#issuecomment-1837208720, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFOOWNDV4DB34L3HTYKIH3YHNQDHAVCNFSM6AAAAAA5BXH3WOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZXGIYDQNZSGA . You are receiving this because you commented.Message ID: @.***>

Ognian commented 11 months ago

Unfortunately, the Surface Pro 7+ doesn't allow TPM disable 😕 Is my next option switch dracut to host only=yes maybe? @Ognian are you running the grub2-install inside the the new container there? Thanks! … On Sat, Dec 2, 2023, 11:26 AM Ognian @.> wrote: Yes this was new with maybe it's just the GRUB version causing issues here? @Ognian https://github.com/Ognian is that new to 2.4? we could cross check the GRUB versions to see if that's causing it yes, this was newly introduced with 2.4. I just tested with 2.4.1 and upgraded to 2.4.2 and the result is with 2.4.2 as it was with 2.4.1 and 2.4.0: with TPM -> out of memory error; without TPM -> boots OK The last version I tested where it worked was v2.2.1 The version I have now is: KAIROS_PRETTY_NAME="kairos-standard-opensuse-leap-15.5 v2.4.2-k3sv1.28.2+k3s1" and sudo grub2-install --version grub2-install (GRUB2) 2.06 Hope this helps. Ognian — Reply to this email directly, view it on GitHub <#1842 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFOOWNDV4DB34L3HTYKIH3YHNQDHAVCNFSM6AAAAAA5BXH3WOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZXGIYDQNZSGA . You are receiving this because you commented.Message ID: @.>

Yes

mevatron commented 11 months ago

this reminds me https://bugs.launchpad.net/oem-priority/+bug/1842320/comments/125 - did we tried setting up gfxmode to 640x480 ?

@mudler I've tried gfxmode=640x480x32 and gfxpayload=640x480x32, but unfortunately it didn't alleviate the OOM errors. I've also tried building from source with @Itxaka recommendation of zstd, which also wasn't enough apparently; however, on my builds from source + Auroraboot do not seem to change the resolution like when I adjust grub settings via cloud_init like it does with official Kairos images. So, maybe a combination will work if I can get the source builds working 🤔

mevatron commented 11 months ago

Just tested @alexander-bauer 's workaround of rmmod tpm on ubuntu-20.04 and it does indeed allow my system to boot, so seems to be related to TPM for me as well.

mevatron commented 11 months ago

@alexander-bauer I found an option that is a bit more robust to remove the tpm module from the grub.cfg.

Create a Dockerfile:

Pick your favorite Kairos image (e.g., ubuntu:20.04).

FROM quay.io/kairos/ubuntu:20.04-standard-amd64-generic-v2.4.2-k3sv1.28.2-k3s1

RUN sed -i '/insmod regexp/a rmmod tpm' /etc/cos/grub.cfg

Build the image:

docker build -t tpm2workaround -f Dockerfile .

Deploy with auroraboot:

For example, generate an ISO:

docker run --rm -ti \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v $(pwd)/config.yaml:/config.yaml \
  -v $(pwd)/build:/tmp/auroraboot \
  quay.io/kairos/auroraboot \
  --set "container_image=docker://tpm2workaround" \
  --set "disable_http_server=true" \
  --set "disable_netboot=true" \
  --set "state_dir=/tmp/auroraboot" \
  --cloud-config /config.yaml

@Itxaka or @mudler might know of an easier way to override this using one of the cloud-init stages, I tried after-install-chroot and before-install, but neither of those seemed to work.

Hope that helps until we get a more permanent fix!

Itxaka commented 11 months ago

Could also try the rc3 that we released yesterday to see if it fixes it, as we reverted the grub.efi to a different one which used to work!

mudler commented 11 months ago

VirtualBox_reinstal tsest_06_12_2023_16_50_51

Here I can reproduce it as well with rc3 and VirtualBox (ubuntu image: kairos-ubuntu-22.04-standard-amd64-generic-v2.4.3-rc3-k3sv1.28.2+k3s1.iso)

mudler commented 11 months ago

VirtualBox_reinstal tsest_06_12_2023_16_50_51

Here I can reproduce it as well with rc3 and VirtualBox (ubuntu image: kairos-ubuntu-22.04-standard-amd64-generic-v2.4.3-rc3-k3sv1.28.2+k3s1.iso)

seems it was just me - recreating the VM with more RAM did the trick

mevatron commented 11 months ago

Could also try the rc3 that we released yesterday to see if it fixes it, as we reverted the grub.efi to a different one which used to work!

I tested with quay.io/kairos/ubuntu:20.04-standard-amd64-generic-v2.4.3-rc3-k3s1.28.2-1, and that worked for the Surface Pro 7+! Many thanks @Itxaka!