kairos-io / kairos

:penguin: The immutable Linux meta-distribution for edge Kubernetes.
https://kairos.io
Apache License 2.0
975 stars 86 forks source link

debian: Unable to boot Kairos installer #2522

Open 6ixfalls opened 1 month ago

6ixfalls commented 1 month ago

Kairos version:

Fails to boot on kairos-debian-bookworm-standard-amd64-generic-v3.0.8-k3sv1.29.3+k3s1, success on kairos-debian-bookworm-standard-amd64-generic-v3.0.0-k3sv1.29.0+k3s1

CPU architecture, OS, and Version:

Linux localhost 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux

(output from v3.0.0)

Describe the bug

The Kairos ISO is unable to boot and I'm unable to install Kairos (manually and automatically).

To Reproduce

Try to install Kairos on the latest version, not sure if this is reproducible. This is running in a KVM VM.

Expected behavior

I should be able to boot into the Kairos install ISO.

Logs

image

Additional context

This bug looks exactly like #2467, but trying the fix there and adding that to a Dockerfile doesn't resolve the issue.

ci-robbot commented 1 month ago

Hello, 6ixfalls! I'm an automated bot assisting with Github issue audits in the kairos project. I've added the 'question' label to your issue (#2522) because it appears we need more information to properly investigate your report.

To enhance our understanding and help us better address your problem, please provide:

Please ensure that your description, steps to reproduce, and version details are explicitly mentioned in your issue. We appreciate your efforts to help us improve Kairos, and don't hesitate to reach out if you have any questions. Note that I am a bot, an experiment of @mudler and @jimmykarily.

Thanks! kairos-io Githubbot

6ixfalls commented 1 month ago

This could be related, but I'm using a custom docker image with auroraboot to generate an ISO. The Dockerfile is here: https://github.com/6ixfalls/taonet-cloud/blob/main/containers/kairos-debian/Dockerfile

It also appears this issue was introduced between 3.0.0 and 3.0.3 - this appears to be a fix to the issue: https://github.com/tyzbit/kairos-distros/commit/e11addab610b5e01f2c81c6610b62841fbf1a20f

tyzbit commented 1 month ago

A note: that was an attempted fix. It didn't fix it for me on 3.0.3 but I didn't try other versions.

jimmykarily commented 1 month ago

Maybe this is relevant? https://github.com/kairos-io/packages/blob/718aaa27e4688559433cd889513f1944a7679ef4/packages/static/kairos-overlay-files/files/system/oem/12_nvidia.yaml#L10

jimmykarily commented 1 month ago

Maybe this is relevant? https://github.com/kairos-io/packages/blob/718aaa27e4688559433cd889513f1944a7679ef4/packages/static/kairos-overlay-files/files/system/oem/12_nvidia.yaml#L10

oh wait, you are not on nvidia. On the other hand, maybe that module needs to be included somehow (?).

jimmykarily commented 1 month ago

Maybe this is relevant? https://github.com/kairos-io/packages/blob/718aaa27e4688559433cd889513f1944a7679ef4/packages/static/kairos-overlay-files/files/system/oem/12_nvidia.yaml#L10

oh wait, you are not on nvidia. On the other hand, maybe that module needs to be included somehow (?).

maybe not that irrelevant after all: https://forums.fedoraforum.org/showthread.php?325865-dracut-FATAL-iscsi-requested-but-kernel-initrd-does-not-support-iscsi

you could try to omit iscsi in dracut to see if this helps

6ixfalls commented 1 month ago

Maybe this is relevant? https://github.com/kairos-io/packages/blob/718aaa27e4688559433cd889513f1944a7679ef4/packages/static/kairos-overlay-files/files/system/oem/12_nvidia.yaml#L10

oh wait, you are not on nvidia. On the other hand, maybe that module needs to be included somehow (?).

maybe not that irrelevant after all: https://forums.fedoraforum.org/showthread.php?325865-dracut-FATAL-iscsi-requested-but-kernel-initrd-does-not-support-iscsi

you could try to omit iscsi in dracut to see if this helps

I'm not sure if this is how to correctly do it, but I tried this configuration and it did not fix the issue.

tyzbit commented 1 month ago

If this does what I suspect, this would break compatibility with at least Longhorn. Can we see what it takes for the kernel to support iscsi?

mudler commented 3 weeks ago

It looks like we need to disable iscsi as we do already for nvidia: https://github.com/kairos-io/kairos/blob/f5c105009a4df27ee3843bc49167eebc29f19bc7/images/Dockerfile.nvidia#L101

Itxaka commented 3 weeks ago

looks like iscsi modules are not properly set in the initramfs as dracut failure indicates that its checking for the iscsi_tcp mod to be available

You could try to install iscsiuio alongside and regenerate the initramfs as that seems to bring the proper iscsi_tcp module needed by dracut

Im gonna try a qucik test here, but I can see already that once installing that package the modules are available and iscsi is added to the dracut modules

what cmdline are you using?

Itxaka commented 3 weeks ago

with a quick patch to install the package alongside Kairos and letting dracut regenerate the initramfs the proper module is available and loaded:

image

athnoc-dev commented 3 weeks ago

I can confirm that customizing the Debian image (only tested this one) from v3.0.0 and up produces the "iscsi error" for dracut. I followed this doc https://kairos.io/docs/advanced/customizing/ at first. Then I used this docker file (https://github.com/kairos-io/kairos/blob/master/images/Dockerfile.kairos-debian) to rebuild the image from scratch while adding packages I needed. Still the iscsi error from dracut appeared. After that I added the "iscsiuio" package and net booting with Aurora worked... the first time.

The second time I launched Aurora at tried to net boot the server, it gave me the same error. I inspected the temp directory to which Auroraboot extracts the ISO and the /netboot directory contains all the net boot artifacts. I inspected the kernel file and compared it to the kernel files in the ISO (which are unpacked in the temp directory).

I found that the net boot kernel (kairos-kernel) was the oldest kernel file and not the most recent, which is why it did not contain the iscsi module of which dracut complains it is not present in the kernel during net boot. I copied the latest kernel and used the other artifacts in /tmp/netboot to start pixiecore and everything worked as expected.

It looks like Auroraboot is picking the wrong kernel (sometimes) for booting, can you confirm?

jimmykarily commented 1 week ago

Let's install iscsiuio by default (all flavors?) so that it makes it to the initramfs.

tyzbit commented 1 week ago

I tried that and it did not seem to help https://github.com/tyzbit/kairos-distros/commit/e11addab610b5e01f2c81c6610b62841fbf1a20f It does strongly seem to be an AuroraBoot issue

mauromorales commented 1 week ago

let's try to replicate in auroraboot and see if we can detect what the issue actually is

athnoc-dev commented 1 week ago

Check which kernel AuroraBoot is using in /tmp/netboot

In my case the errors persisted because an older kernel was used, instead of the latest that had the supporting iscsi modules.

I copied the latest kernel from the temp directory (the unpacked ISO) and replaced the kernel file and all worked fine.

jimmykarily commented 3 days ago
~/workspace/kairos/kairos (master)*$ git diff
diff --git a/images/Dockerfile.debian b/images/Dockerfile.debian
index 39d94482..07862509 100644
--- a/images/Dockerfile.debian
+++ b/images/Dockerfile.debian
@@ -64,6 +64,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
     iputils-ping \
     isc-dhcp-common \
     isc-dhcp-client \
+    iscsiuio \
     jq \
     krb5-locales \
     less \
@@ -162,4 +163,4 @@ RUN systemctl enable systemd-networkd
 RUN systemctl enable ssh

 # Fixup sudo perms
-RUN chown root:root /usr/bin/sudo && chmod 4755 /usr/bin/sudo
\ No newline at end of file
+RUN chown root:root /usr/bin/sudo && chmod 4755 /usr/bin/sudo
diff --git a/images/Dockerfile.kairos-debian b/images/Dockerfile.kairos-debian
index 60c85c1d..3391363c 100644
--- a/images/Dockerfile.kairos-debian
+++ b/images/Dockerfile.kairos-debian
@@ -63,6 +63,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
     iputils-ping \
     isc-dhcp-common \
     isc-dhcp-client \
+    iscsiuio \
     jq \
     krb5-locales \
     less \
docker run --rm -ti -v /var/run/docker.sock:/var/run/docker.sock --net host quay.io/kairos/auroraboot --set "container_image=docker://quay.io/kairos/debian:bookworm-slim-core-amd64-generic-v3.0.4-73-g8ddb9092-dirty"

It successfully boots debian.

Since the docker command I used to run Auroraboot didn't mount any volumes, it's not possible to have cached any data between runs. @tyzbit how are you running Auroraboot? @athnoc-dev suggestion makes me think that some people might be using some command (from our docs?) that is mounting a volume and caches things. Is that the case?

6ixfalls commented 2 days ago

Since the docker command I used to run Auroraboot didn't mount any volumes, it's not possible to have cached any data between runs. tyzbit how are you running Auroraboot? athnoc-dev suggestion makes me think that some people might be using some command (from our docs?) that is mounting a volume and caches things. Is that the case?

This is true in my case - I use auroraboot to generate ISOs to upload to my Kairos nodes, and as a result I have a mount so that I can access the completed ISO. I don't think it should be expected behavior for auroraboot to not generate a new kernel if there's an existing one present - but I'm also not sure if reusing the same directory for building has any effect on the speed of the builds themselves either.

I'm actually not too sure if this is a kernel issue, because as far as I remember this issue occurs with a fresh auroraboot install. However, another thing that appears to be common among everyone who has the issue is that the Kairos Dockerfile is modified (is it possible that the Github Action caching the Docker buildsteps leads to this issue?)