kairos-io / kairos

:penguin: The immutable Linux meta-distribution for edge Kubernetes.
https://kairos.io
Apache License 2.0
1.07k stars 93 forks source link

Fedora 38 gets stuck in boot when provided through auroraboot netboot #1867

Open unusualfor opened 11 months ago

unusualfor commented 11 months ago

Kairos version: NAME="Fedora Linux" VERSION="38 (Container Image)" ID=fedora VERSION_ID=38 VERSION_CODENAME="" PLATFORM_ID="platform:f38" PRETTY_NAME="Fedora Linux 38 (Container Image)" ANSI_COLOR="0;38;2;60;110;180" LOGO=fedora-logo-icon CPE_NAME="cpe:/o:fedoraproject:fedora:38" DEFAULT_HOSTNAME="fedora" HOME_URL="https://fedoraproject.org/" DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f38/system-administrators-guide/" SUPPORT_URL="https://ask.fedoraproject.org/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Fedora" REDHAT_BUGZILLA_PRODUCT_VERSION=38 REDHAT_SUPPORT_PRODUCT="Fedora" REDHAT_SUPPORT_PRODUCT_VERSION=38 SUPPORT_END=2024-05-14 VARIANT="Container Image" VARIANT_ID=container KAIROS_NAME="kairos-kairos-fedora" KAIROS_VERSION="v2.4.0-k3sv1.27.3+k3s1" KAIROS_ID="kairos" KAIROS_ID_LIKE="kairos-kairos-fedora" KAIROS_VERSION_ID="v2.4.0-k3sv1.27.3+k3s1" KAIROS_PRETTY_NAME="kairos-kairos-fedora v2.4.0-k3sv1.27.3+k3s1" KAIROS_BUG_REPORT_URL="https://github.com/kairos-io/kairos/issues" KAIROS_HOME_URL="https://github.com/kairos-io/kairos" KAIROS_IMAGE_REPO="quay.io/kairos/kairos-fedora" KAIROS_IMAGE_LABEL="" KAIROS_GITHUB_REPO="kairos-io/provider-kairos" KAIROS_VARIANT="kairos" KAIROS_FLAVOR="fedora"

CPU architecture, OS, and Version: Linux immutable-01c2 6.4.15-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Sep 7 00:25:01 UTC 2023 x86_64 GNU/Linux

Describe the bug During netboot of the official Fedora 38 (Kairos 2.4.0) with auroraboot and a custom cloud-config, the final host gets stuck in booting (keywords: "Unable to fix SELinux security context of /run", "Failed to mount cgroup at /sys/fs/cgroup/systemd", "Failed to mount API filesystems").

To Reproduce

docker run -v "$PWD"/cloud-config.yaml:/cloud-config.yaml --rm -ti --net host quay.io/kairos/auroraboot \
        --set "artifact_version=v2.4.0-k3sv1.27.3+k3s1" \
        --set "release_version=v2.4.0" \
        --set "flavor=fedora-amd64-generic" \
        --set "repository=kairos-io/provider-kairos" \
        --cloud-config /cloud-config.yaml

cloud-config.yaml (already with the workaround for https://github.com/kairos-io/kairos/issues/1841)

#cloud-config

hostname: immutable-{{ trunc 4 .MachineID }}
install:
    partitions:
        recovery:
            size: 6000
        state:
            size: 9000
    passive:
        size: 2500
    recovery-system:
        size: 2500
    system:
        size: 2500
k3s:
    args:
        - --cluster-cidr="192.168.0.0/16"
        - --service-cidr="10.43.0.0/16"
        - --bind-address="10.3.12.19"
        - --node-ip="10.3.12.19"
        - --disable=traefik,servicelb
        - --disable-network-policy
        - --flannel-backend=none
        - --secrets-encryption
        - --cluster-init
        - --default-local-storage-path=/var/log/local_path
    enabled: true
reset:
    partitions:
        recovery:
            size: 6000
        state:
            size: 9000
    passive:
        size: 2500
    recovery-system:
        size: 2500
    system:
        size: 2500
stages:
    boot:
        - files:
            - content: |
                chmod go-r /etc/rancher/k3s/k3s.yaml
                source <(kubectl completion bash)
                alias k=kubectl
                complete -F __start_kubectl k
                export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
                export ETCDCTL_ENDPOINTS='https://127.0.0.1:2379/'
                export ETCDCTL_CACERT='/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt'
                export ETCDCTL_CERT='/var/lib/rancher/k3s/server/tls/etcd/server-client.crt'
                export ETCDCTL_KEY='/var/lib/rancher/k3s/server/tls/etcd/server-client.key'
                export ETCDCTL_API=3
              group: 0
              owner: 0
              path: /etc/bashrc.d/k3s_bashrc
              permissions: 440
        - commands:
            - echo 'source /etc/bashrc.d/k3s_bashrc' >> /etc/bashrc
          name: Setup bashrc
    initramfs:
        - files:
            - content: "[Match]\nName={{ k3s_interface }}\n\n[Network]\nAddress=10.3.12.19/22\nGateway=10.3.12.1\nDNS=10.150.120.74        \n"
              path: /etc/systemd/network/01-man.network
              permissions: 420
        - files:
            - content: |
                mirrors:
                    docker.io:
                        endpoint:
                            - "https://registry.local/"
                        rewrite:
                            "^rancher/(.*)": "xran-images/k3s/$1"
                configs:
                    "registry.local":
                        auth:
                            username: repouser
                            password: REDACTED
                        tls:
                            ca_file: /etc/ssl/certs/nginx-selfsigned.crt
                            insecure_skip_verify: true
        - commands:
            - echo "10.9.202.12 registry.local" >> /etc/hosts
    network:
        - commands:
            - 'echo | openssl s_client -showcerts -servername 10.9.202.12 -connect 10.9.202.12:443 > /etc/ssl/certs/nginx-selfsigned.crt'
upgrade:
    partitions:
        recovery:
            size: 6000
        state:
            size: 9000
    passive:
        size: 2500
    recovery-system:
        size: 2500
    system:
        size: 2500
users:
    - name: kairos
      passwd: kairos
      ssh_authorized_keys:
        - XXX
    - groups:
        - admin
      name: smo
      passwd: smo
      ssh_authorized_keys:
        - XXX

Expected behavior The final server boots correctly as per auroraboot documentation.

Logs

[root@aurora ~]# docker run -v "$PWD"/cloud-config.yaml:/cloud-config.yaml --rm -ti --net host quay.io/kairos/auroraboot         --set "artifact_version=v2.4.0-k3sv1.27.3+k3s1"         --set "release_version=v2.4.0"         --set "flavor=standard-fedora-amd64-generic"         --set "repository=kairos-io/kairos"         --cloud-config /cloud-config.yaml
4:01PM INF Downloading https://github.com/kairos-io/kairos/releases/download/v2.4.0/kairos-standard-fedora-amd64-generic-v2.4.0-k3sv1.27.3+k3s1.iso...
4:02PM INF Downloading https://github.com/kairos-io/kairos/releases/download/v2.4.0/kairos-standard-fedora-amd64-generic-v2.4.0-k3sv1.27.3+k3s1-initrd...
4:02PM INF Downloading https://github.com/kairos-io/kairos/releases/download/v2.4.0/kairos-standard-fedora-amd64-generic-v2.4.0-k3sv1.27.3+k3s1.squashfs...
4:02PM INF Downloading https://github.com/kairos-io/kairos/releases/download/v2.4.0/kairos-standard-fedora-amd64-generic-v2.4.0-k3sv1.27.3+k3s1-kernel...
4:02PM INF Start pixiecore
4:02PM INF Adding cloud config file to '/tmp/iso/kairos.iso'
4:02PM INF Wrote '/tmp/iso/kairos.iso.custom.iso'
4:02PM INF Listening on :8080...
2023/09/28 16:06:56 DHCP: Offering to boot 08:92:04:9b:c6:6a
2023/09/28 16:07:01 TFTP: Send of "08:92:04:9b:c6:6a/2" to 10.3.12.35:1722 failed: "10.3.12.35:1722": sending OACK: client aborted transfer: User aborted the transfer
2023/09/28 16:07:01 TFTP: Sent "08:92:04:9b:c6:6a/2" to 10.3.12.35:1723
2023/09/28 16:08:34 DHCP: Offering to boot 08:92:04:9b:c6:6a
2023/09/28 16:08:34 HTTP: Sending ipxe boot script to 10.3.12.35:24659
2023/09/28 16:08:34 HTTP: Sent file "kernel" to 10.3.12.35:24659
2023/09/28 16:08:34 HTTP: Sent file "initrd-0" to 10.3.12.35:24659
2023/09/28 16:10:54 HTTP: Sent file "other-0" to 10.3.12.35:48624

immagine

Additional context Host 1: VM with auroraboot, directly connected on the same L2 segment to the final server Final server: either bare metal server (in my case I tested Dell R740) or VM (in my case KVM with no natting involved)

jimmykarily commented 11 months ago

Previous take on this issue: https://github.com/kairos-io/AuroraBoot/pull/36

Itxaka commented 11 months ago

for a quick woraround you can try to pass the option: --set netboot.cmdline="rd.neednet=1 ip=dhcp rd.cos.disable netboot nodepair.enable console=tty0 selinux=0"

to auroraboot to see if it fixes it.

Itxaka commented 10 months ago

@unusualfor were you able to workaround this by setting the netboot.cmdline that I suggested?

unusualfor commented 10 months ago

Yes thanks! The workaround worked fine

Itxaka commented 10 months ago

Hint for whomever works on a fix for this: