rancher / elemental

Elemental is a software stack enabling centralized, full cloud-native OS management with Kubernetes.
https://elemental.docs.rancher.com/
Apache License 2.0
305 stars 39 forks source link

Persistent paths not mounted at boot #732

Closed davidcassany closed 1 year ago

davidcassany commented 1 year ago

What steps did you take and what happened:

The issue raised on a 50 nodes deployment on the CI. There was one node that did not join the cluster. This happened because the immutable-rootfs layout failed to properly mount the persistent partition and all of the state RW paths depending on it. Once booted the bootstrap.sh script failed to install RKE2 as /opt was not writable.

What did you expect to happen: The node boostraps RKE2 and joins the cluster. Probably we should just fail at boot time, immutable-rootfs simply omitted to create the persistent mountpoints and continued to boot normally, probably better to include all them in /etc/fstab and let systemd fail on later stages if the underlaying devices and mountpoints do not appear.

Worth to mention that on boot /usr/local got mounted based on /etc/fstab contents after switching root when systemd took control of the new root /.

Anything else you would like to add:

Journalctl logs of the initrd boot stage contain most of the relevant info:

Mar 13 09:14:12 localhost elemental[784]: INFO[2023-03-13T09:14:12Z] Executing /system/oem/01_elemental-rootfs.yaml
Mar 13 09:14:12 localhost elemental[784]: INFO[2023-03-13T09:14:12Z] Applying 'Elemental Rootfs Layout Settings' for stage 'rootfs.after'. Total stages: 1
Mar 13 09:14:12 localhost elemental[784]: INFO[2023-03-13T09:14:12Z] Processing stage step 'Grow persistent'. ( commands: 0, files: 0, ... )
Mar 13 09:14:12 localhost elemental[784]: INFO[2023-03-13T09:14:12Z] Extending last partition up to 0 MiB
Mar 13 09:14:13 localhost systemd-udevd[601]: sda5: Failed to add device '/dev/sda5' to watch: Operation not permitted
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Stage 'rootfs.after'. Defined stages: 1. Errors: false
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Executing /system/oem/04_accounting.yaml
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Executing /system/oem/05_motd_and_autologin.yaml
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Executing /system/oem/05_network.yaml
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Executing /system/oem/06_recovery.yaml
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Executing /system/oem/07_live.yaml
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Executing /system/oem/08_boot_assessment.yaml
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Executing /system/oem/09_services.yaml
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Executing /system/oem/99_elemental-register.yaml
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Executing /system/oem/99_elemental_system_agent.yaml
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Executing /oem/90_custom.yaml
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Executing /oem/91_custom.yaml
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Executing /oem/registration/config.yaml
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Done executing stage 'rootfs.after'
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Running stage: rootfs.before
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Executing BOOT_IMAGE=(loop0)/boot/vmlinuz console=tty1 console=ttyS0 root=LABEL=COS_STATE cos-img/filename=/cOS/active.img panic=5 rd.neednet=0 rd.cos.oemlabel=COS_OEM fsck.>
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Done executing stage 'rootfs.before'
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Running stage: rootfs
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Executing BOOT_IMAGE=(loop0)/boot/vmlinuz console=tty1 console=ttyS0 root=LABEL=COS_STATE cos-img/filename=/cOS/active.img panic=5 rd.neednet=0 rd.cos.oemlabel=COS_OEM fsck.>
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Done executing stage 'rootfs'
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Running stage: rootfs.after
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Executing BOOT_IMAGE=(loop0)/boot/vmlinuz console=tty1 console=ttyS0 root=LABEL=COS_STATE cos-img/filename=/cOS/active.img panic=5 rd.neednet=0 rd.cos.oemlabel=COS_OEM fsck.>
Mar 13 09:14:13 localhost elemental[784]: INFO[2023-03-13T09:14:13Z] Done executing stage 'rootfs.after'
Mar 13 09:14:13 localhost systemd[1]: Finished cOS system early rootfs setup.
Mar 13 09:14:13 localhost systemd[1]: Starting cOS system immutable rootfs mounts...
Mar 13 09:14:13 localhost systemd[1]: Unmounting /oem...
Mar 13 09:14:13 localhost systemd[1]: initrd-parse-etc.service: Deactivated successfully.
Mar 13 09:14:13 localhost systemd[1]: Finished Reload Configuration from the Real Root.
Mar 13 09:14:13 localhost systemd[1]: Starting dracut mount hook...
Mar 13 09:14:13 localhost systemd[1]: oem.mount: Deactivated successfully.
Mar 13 09:14:13 localhost systemd[1]: Unmounted /oem.
Mar 13 09:14:13 localhost systemd[1]: Finished dracut mount hook.
Mar 13 09:14:13 localhost kernel: EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Mar 13 09:14:13 localhost systemd-fsck[978]: Failed to stat /dev/disk/by-label/COS_PERSISTENT: No such file or directory
Mar 13 09:14:13 localhost cos-mount-layout[979]: Warning: /dev/disk/by-label/COS_PERSISTENT already mounted or device not found
Mar 13 09:14:13 localhost cos-mount-layout[989]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 13 09:14:13 localhost cos-mount-layout[987]: failed creating '/sysroot/etc/systemd' or '/sysroot/usr/local/.state/etc-systemd.bind'. Ignoring '/sysroot/etc/systemd' mount
Mar 13 09:14:13 localhost cos-mount-layout[992]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 13 09:14:13 localhost cos-mount-layout[990]: failed creating '/sysroot/etc/rancher' or '/sysroot/usr/local/.state/etc-rancher.bind'. Ignoring '/sysroot/etc/rancher' mount
Mar 13 09:14:13 localhost cos-mount-layout[995]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 13 09:14:13 localhost cos-mount-layout[993]: failed creating '/sysroot/etc/ssh' or '/sysroot/usr/local/.state/etc-ssh.bind'. Ignoring '/sysroot/etc/ssh' mount
Mar 13 09:14:13 localhost cos-mount-layout[998]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 13 09:14:13 localhost cos-mount-layout[996]: failed creating '/sysroot/etc/iscsi' or '/sysroot/usr/local/.state/etc-iscsi.bind'. Ignoring '/sysroot/etc/iscsi' mount
Mar 13 09:14:13 localhost cos-mount-layout[1001]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 13 09:14:13 localhost cos-mount-layout[999]: failed creating '/sysroot/etc/cni' or '/sysroot/usr/local/.state/etc-cni.bind'. Ignoring '/sysroot/etc/cni' mount
Mar 13 09:14:13 localhost cos-mount-layout[1004]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 13 09:14:13 localhost cos-mount-layout[1002]: failed creating '/sysroot/home' or '/sysroot/usr/local/.state/home.bind'. Ignoring '/sysroot/home' mount
Mar 13 09:14:13 localhost cos-mount-layout[1007]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 13 09:14:13 localhost cos-mount-layout[1005]: failed creating '/sysroot/opt' or '/sysroot/usr/local/.state/opt.bind'. Ignoring '/sysroot/opt' mount
Mar 13 09:14:13 localhost cos-mount-layout[1010]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 13 09:14:13 localhost cos-mount-layout[1008]: failed creating '/sysroot/root' or '/sysroot/usr/local/.state/root.bind'. Ignoring '/sysroot/root' mount
Mar 13 09:14:13 localhost cos-mount-layout[1013]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 13 09:14:13 localhost cos-mount-layout[1011]: failed creating '/sysroot/usr/libexec' or '/sysroot/usr/local/.state/usr-libexec.bind'. Ignoring '/sysroot/usr/libexec' mount
Mar 13 09:14:13 localhost cos-mount-layout[1016]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 13 09:14:13 localhost cos-mount-layout[1014]: failed creating '/sysroot/var/log' or '/sysroot/usr/local/.state/var-log.bind'. Ignoring '/sysroot/var/log' mount
Mar 13 09:14:13 localhost cos-mount-layout[1019]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 13 09:14:13 localhost cos-mount-layout[1017]: failed creating '/sysroot/var/lib/elemental' or '/sysroot/usr/local/.state/var-lib-elemental.bind'. Ignoring '/sysroot/var/lib/elemental' mount
Mar 13 09:14:13 localhost cos-mount-layout[1022]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 13 09:14:13 localhost cos-mount-layout[1020]: failed creating '/sysroot/var/lib/rancher' or '/sysroot/usr/local/.state/var-lib-rancher.bind'. Ignoring '/sysroot/var/lib/rancher' mount
Mar 13 09:14:13 localhost cos-mount-layout[1025]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 13 09:14:13 localhost cos-mount-layout[1023]: failed creating '/sysroot/var/lib/kubelet' or '/sysroot/usr/local/.state/var-lib-kubelet.bind'. Ignoring '/sysroot/var/lib/kubelet' mount
Mar 13 09:14:13 localhost cos-mount-layout[1028]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 13 09:14:13 localhost cos-mount-layout[1026]: failed creating '/sysroot/var/lib/NetworkManager' or '/sysroot/usr/local/.state/var-lib-NetworkManager.bind'. Ignoring '/sysroot/var/lib/NetworkManager' mount
Mar 13 09:14:13 localhost cos-mount-layout[1031]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 13 09:14:13 localhost cos-mount-layout[1029]: failed creating '/sysroot/var/lib/longhorn' or '/sysroot/usr/local/.state/var-lib-longhorn.bind'. Ignoring '/sysroot/var/lib/longhorn' mount
Mar 13 09:14:13 localhost cos-mount-layout[1034]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 13 09:14:13 localhost cos-mount-layout[1032]: failed creating '/sysroot/var/lib/cni' or '/sysroot/usr/local/.state/var-lib-cni.bind'. Ignoring '/sysroot/var/lib/cni' mount
Mar 13 09:14:13 localhost cos-mount-layout[1037]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 13 09:14:13 localhost cos-mount-layout[1035]: failed creating '/sysroot/var/lib/calico' or '/sysroot/usr/local/.state/var-lib-calico.bind'. Ignoring '/sysroot/var/lib/calico' mount
Mar 13 09:14:13 localhost systemd[1]: Finished cOS system immutable rootfs mounts.
Mar 13 09:14:13 localhost systemd[1]: Reached target Initrd File Systems.

Environment:

$ rpm -qa | grep elemental
elemental-register-1.1.3+git20230310.3f0e357-150400.111.1.x86_64
elemental-support-1.1.3+git20230310.3f0e357-150400.111.1.x86_64
elemental-system-agent-0.3.1-150400.2.1.x86_64
elemental-updater-1.1.1+git20230311.9525ad7-150400.208.1.noarch
elemental-cli-0.2.1+git20230308.cb8b7d0-150400.70.1.x86_64
elemental-toolkit-0.10.1+git20230303.3b57d5d-150400.91.1.noarch
elemental-1.1.1+git20230311.9525ad7-150400.208.1.noarch
...
kkaempf commented 1 year ago
 systemd-udevd[601]: sda5: Failed to add device '/dev/sda5' to watch: Operation not permitted

What's the root cause of this failure ?

davidcassany commented 1 year ago

I have no clue... 🤷🏽‍♂️ It happened in only one node out of 50, so I believe it has to be a race condition somewhere. We can attempt to be more strict on mounting all the special paths at boot and wait for the devices if they are not there yet.

kkaempf commented 1 year ago

This is also strange


Extending last partition up to 0 MiB
davidcassany commented 1 year ago

Yes, the message here is misleading. There is a the expansion of the COS_PERSISTENT partition enabled by default (which I doubt is actually valuable for our current workflow). If size is set to 0 it expands to all available size, we did not properly cover this case in logging. Also, it did not expand as the partition already makes use of all available space.

I have been wondering about if the attempt of expanding the partition could cause some sort of collision with udev at boot. So wondering if at the time parted accessed the partition (even without applying any change) that could cause some glitch with udev daemon.

ldevulder commented 1 year ago

I had the issue in a test: rancher-system-agent service failed because binary is not found in /opt as this filesystem is in read-only state.

I have this in /etc/fstab:

/dev/loop0 / auto ro 0 0
tmpfs /run/overlay tmpfs defaults,size=25% 0 0
overlay /etc overlay defaults,lowerdir=/etc,upperdir=/run/overlay/etc.overlay/upper,workdir=/run/overlay/etc.overlay/work,x-systemd.requires-mounts-for=/run/overlay
/dev/disk/by-label/COS_OEM /oem auto defaults 0 0
overlay /srv overlay defaults,lowerdir=/srv,upperdir=/run/overlay/srv.overlay/upper,workdir=/run/overlay/srv.overlay/work,x-systemd.requires-mounts-for=/run/overlay
/dev/disk/by-label/COS_PERSISTENT /usr/local auto defaults 0 0
overlay /var overlay defaults,lowerdir=/var,upperdir=/run/overlay/var.overlay/upper,workdir=/run/overlay/var.overlay/work,x-systemd.requires-mounts-for=/run/overlay

Elemental packages:

# rpm -qa | grep elemental
elemental-dracut-config-0.10.3+git20230315.63e8c41-150400.94.1.noarch
elemental-grub-config-0.10.3+git20230315.63e8c41-150400.94.1.noarch
elemental-immutable-rootfs-0.10.3+git20230315.63e8c41-150400.94.1.noarch
elemental-register-1.1.4+git20230315.63c99d7-150400.115.1.x86_64
elemental-support-1.1.4+git20230315.63c99d7-150400.115.1.x86_64
elemental-system-agent-0.3.1-150400.2.2.x86_64
elemental-updater-1.1.2+git20230315.f7048f1-150400.212.1.noarch
elemental-cli-0.2.2+git20230310.c214054-150400.72.2.x86_64
elemental-init-setup-0.10.3+git20230315.63e8c41-150400.94.1.noarch
elemental-init-services-0.10.3+git20230315.63e8c41-150400.94.1.noarch
elemental-init-recovery-0.10.3+git20230315.63e8c41-150400.94.1.noarch
elemental-init-network-0.10.3+git20230315.63e8c41-150400.94.1.noarch
elemental-init-live-0.10.3+git20230315.63e8c41-150400.94.1.noarch
elemental-init-boot-assessment-0.10.3+git20230315.63e8c41-150400.94.1.noarch
elemental-init-config-0.10.3+git20230315.63e8c41-150400.94.1.noarch
elemental-toolkit-0.10.3+git20230315.63e8c41-150400.94.1.noarch
elemental-1.1.2+git20230315.f7048f1-150400.212.1.noarch

Here the full journalctl log: boot.log.gz

A reboot of the node fixed the issue, so maybe an issue when the /etc/fstab file is generating during the boot?

kkaempf commented 1 year ago

This looks like a missing udevadm settle 🤔

Mar 15 21:15:51 localhost elemental[784]: INFO[2023-03-15T21:15:51Z] Extending last partition up to 0 MiB
Mar 15 21:15:51 localhost systemd-udevd[617]: sda5: Failed to add device '/dev/sda5' to watch: Operation not permitted
kkaempf commented 1 year ago

And here's the first error about a read-only filesystem

Mar 15 21:15:52 localhost elemental[1039]: INFO[2023-03-15T21:15:52Z] Executing /system/oem/01_elemental-rootfs.yaml
Mar 15 21:15:52 localhost elemental[1039]: INFO[2023-03-15T21:15:52Z] Applying 'Elemental Rootfs Layout Settings' for stage 'initramfs'. Total stages: 3
Mar 15 21:15:52 localhost elemental[1039]: INFO[2023-03-15T21:15:52Z] Processing stage step ''. ( commands: 1, files: 0, ... )
Mar 15 21:15:52 localhost elemental[1039]: INFO[2023-03-15T21:15:52Z] Command output: mkdir: cannot create directory ‘/usr/local/etc’: Read-only file system
Mar 15 21:15:52 localhost elemental[1039]: sh: line 2: /usr/local/etc/hostname: No such file or directory
kkaempf commented 1 year ago

@ldevulder is that Dev, Staging or Stable ?

kkaempf commented 1 year ago
Mar 15 21:15:51 localhost systemd-fsck[978]: Failed to stat /dev/disk/by-label/COS_PERSISTENT: No such file or directory

😮

davidcassany commented 1 year ago

I'd say we have few of things to consider here, all actionable:

ldevulder commented 1 year ago

@ldevulder is that Dev, Staging or Stable ?

It's on Dev, Staging should be the same as it seems to have happened before the last sync between Dev and Staging. Not tested with Stable.

davidcassany commented 1 year ago

In rancher/elemental-toolkit#1743 instead of explicitly erroring out in case there is some error to prepare or mount a mountpoint and just changed to logic to not ignore it an still include the mountpoint in fstab. The idea is that we keep creating the fstab as requested but do not fail if we can't mount everything, systemd will later on fail if it can't manage to mount all fstab entries.

davidcassany commented 1 year ago

Moved to needs review as verifying the fix requires some tests running big clusters (the issue turned to be more or less consistent with clusters with more the 100 nodes).

To fully verify the fix we should see in some node boot logs that there was the following udev error message:

systemd-udevd[601]: sda5: Failed to add device '/dev/sda5' to watch: Operation not permitted

and despite this, the node successfully booted and joined the cluster. I guess that means parsing the boot logs of every single node.

davidcassany commented 1 year ago

Apparently the fix was not enough... 😞 looks like there might be some sort of issue with udev. With the logs of a failed node including the fix we noticed a slightly different behavior. The Operation not permitted on device /dev/sda5 was not there, however cos-mount-layout.sh still failed properly prepare the layout.

Mar 20 10:23:05 localhost systemd[1]: Starting cOS system immutable rootfs mounts...
Mar 20 10:23:05 localhost systemd[1]: Unmounting /oem...
Mar 20 10:23:05 localhost systemd[1]: initrd-parse-etc.service: Deactivated successfully.
Mar 20 10:23:05 localhost systemd[1]: Finished Reload Configuration from the Real Root.
Mar 20 10:23:05 localhost systemd[1]: Starting dracut mount hook...
Mar 20 10:23:05 localhost systemd[1]: oem.mount: Deactivated successfully.
Mar 20 10:23:05 localhost systemd[1]: Unmounted /oem.
Mar 20 10:23:05 localhost systemd[1]: Finished dracut mount hook.
Mar 20 10:23:05 localhost kernel: EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Mar 20 10:24:06 localhost systemd-fsck[1096]: Failed to stat /dev/disk/by-label/COS_PERSISTENT: No such file or directory
Mar 20 10:24:06 localhost cos-mount-layout[1097]: Warning: /dev/disk/by-label/COS_PERSISTENT already mounted or device not found
Mar 20 10:24:06 localhost cos-mount-layout[1107]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 20 10:24:06 localhost cos-mount-layout[1105]: failed creating '/sysroot/etc/systemd' or '/sysroot/usr/local/.state/etc-systemd.bind'
Mar 20 10:24:06 localhost cos-mount-layout[1110]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 20 10:24:06 localhost cos-mount-layout[1108]: failed creating '/sysroot/etc/rancher' or '/sysroot/usr/local/.state/etc-rancher.bind'
Mar 20 10:24:06 localhost cos-mount-layout[1113]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 20 10:24:06 localhost cos-mount-layout[1111]: failed creating '/sysroot/etc/ssh' or '/sysroot/usr/local/.state/etc-ssh.bind'
Mar 20 10:24:06 localhost cos-mount-layout[1116]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 20 10:24:06 localhost cos-mount-layout[1114]: failed creating '/sysroot/etc/iscsi' or '/sysroot/usr/local/.state/etc-iscsi.bind'
Mar 20 10:24:06 localhost cos-mount-layout[1119]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 20 10:24:06 localhost cos-mount-layout[1117]: failed creating '/sysroot/etc/cni' or '/sysroot/usr/local/.state/etc-cni.bind'
Mar 20 10:24:06 localhost cos-mount-layout[1122]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 20 10:24:06 localhost cos-mount-layout[1120]: failed creating '/sysroot/home' or '/sysroot/usr/local/.state/home.bind'
Mar 20 10:24:06 localhost cos-mount-layout[1125]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 20 10:24:06 localhost cos-mount-layout[1123]: failed creating '/sysroot/opt' or '/sysroot/usr/local/.state/opt.bind'
Mar 20 10:24:06 localhost cos-mount-layout[1128]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 20 10:24:06 localhost cos-mount-layout[1126]: failed creating '/sysroot/root' or '/sysroot/usr/local/.state/root.bind'
Mar 20 10:24:06 localhost cos-mount-layout[1131]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 20 10:24:06 localhost cos-mount-layout[1129]: failed creating '/sysroot/usr/libexec' or '/sysroot/usr/local/.state/usr-libexec.bind'
Mar 20 10:24:06 localhost cos-mount-layout[1134]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 20 10:24:06 localhost cos-mount-layout[1132]: failed creating '/sysroot/var/log' or '/sysroot/usr/local/.state/var-log.bind'
Mar 20 10:24:06 localhost cos-mount-layout[1137]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 20 10:24:06 localhost cos-mount-layout[1135]: failed creating '/sysroot/var/lib/elemental' or '/sysroot/usr/local/.state/var-lib-elemental.bind'
Mar 20 10:24:06 localhost cos-mount-layout[1140]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 20 10:24:06 localhost cos-mount-layout[1138]: failed creating '/sysroot/var/lib/rancher' or '/sysroot/usr/local/.state/var-lib-rancher.bind'
Mar 20 10:24:06 localhost cos-mount-layout[1143]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 20 10:24:06 localhost cos-mount-layout[1141]: failed creating '/sysroot/var/lib/kubelet' or '/sysroot/usr/local/.state/var-lib-kubelet.bind'
Mar 20 10:24:06 localhost cos-mount-layout[1146]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 20 10:24:06 localhost cos-mount-layout[1144]: failed creating '/sysroot/var/lib/NetworkManager' or '/sysroot/usr/local/.state/var-lib-NetworkManager.bind'
Mar 20 10:24:06 localhost cos-mount-layout[1149]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 20 10:24:06 localhost cos-mount-layout[1147]: failed creating '/sysroot/var/lib/longhorn' or '/sysroot/usr/local/.state/var-lib-longhorn.bind'
Mar 20 10:24:06 localhost cos-mount-layout[1152]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 20 10:24:06 localhost cos-mount-layout[1150]: failed creating '/sysroot/var/lib/cni' or '/sysroot/usr/local/.state/var-lib-cni.bind'
Mar 20 10:24:06 localhost cos-mount-layout[1155]: mkdir: cannot create directory '/sysroot/usr/local/.state': Read-only file system
Mar 20 10:24:06 localhost cos-mount-layout[1153]: failed creating '/sysroot/var/lib/calico' or '/sysroot/usr/local/.state/var-lib-calico.bind'
Mar 20 10:24:06 localhost systemd[1]: Finished cOS system immutable rootfs mounts.

See the logs cos-mount-layout.sh does not succeed to prepare the mountpoints due to the missing device /dev/disk/by-label/COS_PERSISTENT, it even waits for it during 1min having a udevadm settle call on every second. The interesting fact is that later on when switching root the device is found properly and all mountpoints are created according to the generated /etc/fstab. Rebooting fixes the issue as before.

kkaempf commented 1 year ago

Time to raise a kernel/systemd bug via Bugzilla ?! 🤔

davidcassany commented 1 year ago

We already had a couple of successful big deployments with 150 nodes and more. Closing as fixed, both successful runs were including https://github.com/rancher/elemental-toolkit/pull/1744