k3s fails to start on Raspberry Pi

tantalic commented 1 year ago

Kairos version: NAME="openSUSE Leap" VERSION="15.5" ID="opensuse-leap" ID_LIKE="suse opensuse" VERSION_ID="15.5" PRETTY_NAME="openSUSE Leap 15.5" ANSI_COLOR="0;32" CPE_NAME="cpe:/o:opensuse:leap:15.5" BUG_REPORT_URL="https://bugs.opensuse.org" HOME_URL="https://www.opensuse.org/" DOCUMENTATION_URL="https://en.opensuse.org/Portal:Leap" LOGO="distributor-logo-Leap" KAIROS_NAME="kairos-opensuse-leap-arm-rpi" KAIROS_VERSION="v2.3.0-k3sv1.27.3+k3s1" KAIROS_ID="kairos" KAIROS_ID_LIKE="kairos-opensuse-leap-arm-rpi" KAIROS_VERSION_ID="v2.3.0-k3sv1.27.3+k3s1" KAIROS_PRETTY_NAME="kairos-opensuse-leap-arm-rpi v2.3.0-k3sv1.27.3+k3s1" KAIROS_BUG_REPORT_URL="https://github.com/kairos-io/kairos/issues/new/choose" KAIROS_HOME_URL="https://github.com/kairos-io/provider-kairos" KAIROS_IMAGE_REPO="quay.io/kairos/kairos-opensuse-leap-arm-rpi" KAIROS_IMAGE_LABEL="latest" KAIROS_GITHUB_REPO="kairos-io/provider-kairos" KAIROS_VARIANT="kairos"

CPU architecture, OS, and Version: Linux yak-001 5.14.21-150500.53-default #1 SMP PREEMPT_DYNAMIC Wed May 10 07:56:26 UTC 2023 (b630043) aarch64 aarch64 aarch64 GNU/Linux

Describe the bug K3s fails to start on Raspberry PI (have tried on Ubuntu and OpenSUSE based images). Due to an error writing to /var/lib/rancher/k3s/data

To Reproduce

Install an extracted or custom created Raspberry Pi image on to a SD card
Enable k3s in a cloud-init file
Power on the Raspberry Pi

Expected behavior K3s should start successfully

Logs

sudo systemctl status k3s shows the service is stuck in the activating state, continually in an auto-restart loop:

● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/k3s.service.d
             └─override.conf
     Active: activating (auto-restart) (Result: exit-code) since Tue 2023-07-11 02:25:08 UTC; 776ms ago
       Docs: https://k3s.io
    Process: 2096 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
    Process: 2098 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 2099 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
    Process: 2100 ExecStart=/usr/bin/k3s server (code=exited, status=1/FAILURE)
   Main PID: 2100 (code=exited, status=1/FAILURE)

sudo journalctl -u k3s shows a failure loop due to being unable to extract data into var/lib/rancher/k3s/data:

systemd[1]: Starting Lightweight Kubernetes...
sh[3096]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
sh[3097]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
k3s[3100]: time="2023-07-11T02:34:23Z" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
k3s[3100]: time="2023-07-11T02:34:23Z" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/8147fdcc81517672a3573345f56cc1fc8eb>
k3s[3100]: time="2023-07-11T02:34:23Z" level=info msg="error extracting tarball into /var/lib/rancher/k3s/data/8147fdcc81517672a3573345>
k3s[3100]: time="2023-07-11T02:34:24Z" level=fatal msg="extracting data: error writing to /var/lib/rancher/k3s/data/8147fdcc81517672a35>
systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: k3s.service: Failed with result 'exit-code'.
systemd[1]: Failed to start Lightweight Kubernetes.
systemd[1]: k3s.service: Scheduled restart job, restart counter is at 139.
systemd[1]: Stopped Lightweight Kubernetes.
systemd[1]: Starting Lightweight Kubernetes...

Note: I tried this previously in Kairos 2.2.1 images for both openSUSE and Ubuntu. I wanted to try it on the latest release today before filing to make sure it wasn't already addressed.

dutchmillbytes commented 1 year ago

Some parts of the logs seems to have been cut off. The actual error message is "no space left on device". And indeed the /dev/disk/by-label/COS_PERSISTENT partition mounted on /usr/local was at 100% use.

tantalic commented 1 year ago

More information on this specific image. This was created following the instructions under Installation > RaspberryPi > Customizing the Disk Image. However, after running into this issue previously to the same affect I doubled many of the disk size related parameters in an attempt to fix it. The command run to make the image was:


docker run -v /path/to/dir:/HERE \
        -v /var/run/docker.sock:/var/run/docker.sock \
        --privileged -i --rm \
        --entrypoint=/build-arm-image.sh quay.io/kairos/osbuilder-tools:latest \
        --use-lvm \
        --model rpi64 \
        --state-partition-size 12400 \
        --recovery-partition-size 4200 \
        --size 30400 \
        --images-size 4000 \
        --local \
        --config /HERE/cloud-config.yaml \
        --docker-image quay.io/kairos/kairos-opensuse-leap-arm-rpi:v2.3.0-k3sv1.27.3-k3s1 /HERE/yak-001.img```

mauromorales commented 1 year ago

I think the issue might be related to the size of the persistent partition:

lsblk -o NAME,SIZE,LABEL
NAME                   SIZE LABEL
loop0                    2G COS_ACTIVE
mmcblk0               29.2G 
├─mmcblk0p1             96M COS_GRUB
├─mmcblk0p2            6.1G COS_STATE
├─mmcblk0p3            4.2G 
│ ├─KairosVG-oem        64M COS_OEM
│ └─KairosVG-recovery  4.1G COS_RECOVERY
└─mmcblk0p4             64M COS_PERSISTENT

mauromorales commented 1 year ago

The issue is caused because there's a failure resizing persistent https://github.com/kairos-io/kairos/issues/1448#issuecomment-1632497108

mauromorales commented 1 year ago

Until this is fixed, you can run the command manually (annoying but only needs to be done once)

In GRUB add rd.break=initqueue
when you drop in the rescue shell run: sgdisk -g -d=4 -n=4:21635072:+0 -c=4:Linux\ filesystem -t=4:8300 /dev/mmcblk0
- keep in mind the start from sector will depend on the values of the image, so validate before
- and you might want to run the command with -P to dry run before running it without
restart

On next boot you should see something like

lsblk -o NAME,SIZE,LABEL
NAME                   SIZE LABEL
loop0                    2G COS_ACTIVE
mmcblk0               29.2G 
├─mmcblk0p1             96M COS_GRUB
├─mmcblk0p2            6.1G COS_STATE
├─mmcblk0p3            4.2G 
│ ├─KairosVG-oem        64M COS_OEM
│ └─KairosVG-recovery  4.1G COS_RECOVERY
└─mmcblk0p4           18.9G COS_PERSISTENT

mauromorales commented 1 year ago

Forgot to mention you also need to resize the fs:

resize2fs /dev/disk/by-label/COS_PERSISTENT
resize2fs 1.46.4 (18-Aug-2021)
Filesystem at /dev/disk/by-label/COS_PERSISTENT is mounted on /usr/local; on-line resizing required
old_desc_blocks = 1, new_desc_blocks = 152
The filesystem on /dev/disk/by-label/COS_PERSISTENT is now 19795436 (1k) blocks long.

And starting K3s after this, should work

systemctl start k3s
systemctl status k3s
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s.service; disabled; vendor preset: disabled)
     Active: active (running) since Thu 2023-07-13 07:54:31 UTC; 4s ago

mudler commented 1 year ago

This card is blocked because if we drop support for RPI3 we will switch to GPT as partitioning layout (see rationale in: https://github.com/kairos-io/osbuilder/pull/68#issuecomment-1635834382 ).

See poll in : https://github.com/kairos-io/kairos/discussions/1608

jimmykarily commented 1 year ago

Blocked by this: https://github.com/kairos-io/kairos/issues/1636

Itxaka commented 1 year ago

@mauromorales this should now be fixed on master with model rpi4, if ther reason was not enough space under persistent.

mauromorales commented 1 year ago

fantastic! Let's move under review and I'll test

jimmykarily commented 1 year ago

Let's not forget to document the manual expansion of rpi3

jimmykarily commented 1 year ago

There is already an issue for that: https://github.com/kairos-io/kairos/issues/1637 . @mauromorales please close this as soon as you test this and it works.

mauromorales commented 1 year ago

blocked by https://github.com/kairos-io/kairos/issues/1725

mauromorales commented 1 year ago

With the latest changes, the space is there but I still encounter this issue when trying to start k3s:

systemctl is-enabled --quiet nm-cloud-setup.service
Failed to get unit file state for nm-cloud-setup.service: No such file or directory

mauromorales commented 1 year ago

this was related to not having network on the rpi ... so I will add an alert on the docs

jimmykarily commented 1 year ago

One more bug preventing this from working: https://github.com/kairos-io/kairos/issues/1831

mordax7 commented 1 year ago

I am having the same issue with the latest Ubuntu image running on a RPI4.

Is there a workaround for this yet?

jimmykarily commented 1 year ago

I am having the same issue with the latest Ubuntu image running on a RPI4.

Is there a workaround for this yet?

Automatic expansion of the last partition of Rasbperry Pi 4 is fixed in v2.4.1. Creating images for rpi3 has its own ticket: https://github.com/kairos-io/kairos/issues/1816.

We can close this now.

jimmykarily commented 1 year ago

I just realized you said you used the latest Ubuntu image (Issue re-opened). Is you last partition expanded to fill the whole disk? If yes, then k3s not starting might be because of something else. Can you send us the journalctl logs (after booting) @mordax7 ?

mordax7 commented 1 year ago

Hey @jimmykarily, sure here is more information:

Used Raspberry: RPI 4(8GB)
The image used: kairos-standard-ubuntu-arm64-rpi4-v2.4.1-k3sv1.27.3+k3s1.img

The journal:

systemd[1]: Starting k3s.service - Lightweight Kubernetes...
sh[1745]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
systemctl[1746]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
k3s[1749]: time="2023-10-10T11:09:17Z" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
k3s[1749]: time="2023-10-10T11:09:17Z" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/8147fdcc81517672a3573345f56cc1fc8eb0b1135e84cd4dbc06eb673620d610"
k3s[1749]: time="2023-10-10T11:09:19Z" level=info msg="error extracting tarball into /var/lib/rancher/k3s/data/8147fdcc81517672a3573345f56cc1fc8eb0b1135e84cd4dbc06eb673620d610-tmp after 5 files, 1 dirs, 1.873067111s: error writing to /var/lib/rancher/k3s/data/8147fdcc81517672a3573345f56cc1fc8eb0b1135e84cd4dbc06eb673620d610-tmp/bin/k3s: write /var/lib/rancher/k3s/data/8147fdcc81517672a3573345f56cc1fc8eb0b1135e84cd4dbc06eb673620d610-tmp/bin/k3s: no space left on device"
k3s[1749]: time="2023-10-10T11:09:20Z" level=fatal msg="extracting data: error writing to /var/lib/rancher/k3s/data/8147fdcc81517672a3573345f56cc1fc8eb0b1135e84cd4dbc06eb673620d610-tmp/bin/k3s: write /var/lib/rancher/k3s/data/8147fdcc81517672a3573345f56cc1fc8eb0b1135e84cd4dbc06eb673620d610-tmp/bin/k3s: no space left on device"
systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: k3s.service: Failed with result 'exit-code'.
systemd[1]: Failed to start k3s.service - Lightweight Kubernetes.
systemd[1]: k3s.service: Consumed 1.957s CPU time.

lsblk:

root@localhost:~# lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
loop0         7:0    0  2.6G  1 loop /
mmcblk0     179:0    0 29.7G  0 disk
|-mmcblk0p1 179:1    0   96M  0 part
|-mmcblk0p2 179:2    0  7.9G  0 part /run/initramfs/cos-state
|-mmcblk0p3 179:3    0  5.3G  0 part
|-mmcblk0p4 179:4    0   64M  0 part /oem
`-mmcblk0p5 179:5    0 16.4G  0 part /var/lib/wicked
                                 /var/lib/snapd
                                 /var/lib/rancher
                                 /var/lib/longhorn
                                 /var/lib/kubelet
                                 /var/lib/extensions
                                 /var/lib/dbus
                                 /var/lib/containerd
                                 /var/lib/cni
                                 /etc/ssl/certs
                                 /var/lib/ca-certificates
                                 /etc/zfs
                                 /etc/systemd
                                 /etc/sysconfig
                                 /etc/ssh
                                 /var/snap
                                 /etc/runlevels
                                 /etc/rancher
                                 /etc/modprobe.d
                                 /var/log
                                 /usr/libexec
                                 /etc/kubernetes
                                 /etc/iscsi
                                 /etc/cni
                                 /snap
                                 /root
                                 /opt
                                 /home
                                 /usr/local

df -h:

root@localhost:~# df -h
Filesystem                         Size  Used Avail Use% Mounted on
tmpfs                              3.9G     0  3.9G   0% /dev/shm
tmpfs                              1.6G  9.9M  1.6G   1% /run
tmpfs                              5.0M     0  5.0M   0% /run/lock
/dev/loop0                         2.6G  2.6G     0 100% /
/dev/disk/by-label/COS_OEM          55M   15K   51M   1% /oem
tmpfs                              2.0G  492K  2.0G   1% /run/overlay
/dev/disk/by-label/COS_PERSISTENT   55M   34M   17M  67% /usr/local
overlay                            2.0G  492K  2.0G   1% /var
overlay                            2.0G  492K  2.0G   1% /etc
overlay                            2.0G  492K  2.0G   1% /srv
tmpfs                              3.9G     0  3.9G   0% /tmp
/dev/mmcblk0p2                     7.7G  5.2G  2.2G  71% /run/initramfs/cos-state
tmpfs                              786M  4.0K  786M   1% /run/user/0

fdisk -l


root@localhost:~# fdisk -l
Disk /dev/loop0: 2.64 GiB, 2831155200 bytes, 5529600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk /dev/mmcblk0: 29.72 GiB, 31914983424 bytes, 62333952 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 7A3E8686-3DD9-9647-902C-A40A44CBF3E3

Device Start End Sectors Size Type /dev/mmcblk0p1 8192 204799 196608 96M Microsoft basic data /dev/mmcblk0p2 204800 16793599 16588800 7.9G Linux filesystem /dev/mmcblk0p3 16793600 27852799 11059200 5.3G Linux filesystem /dev/mmcblk0p4 27852800 27983871 131072 64M Linux filesystem /dev/mmcblk0p5 27983872 62333918 34350047 16.4G Linux filesystem

jimmykarily commented 1 year ago

Hmm, the partitions seems expanded:

mmcblk0p5 179:5    0 16.4G  0

but the filesystem not (?):

/dev/disk/by-label/COS_PERSISTENT   55M   34M   17M  67% /usr/local

Does this still occurs after a reboot? I remember we needed to run partprobe in some cases to the new partition to be read. Though it's the filesystem that doesn't seem to expand but worths checking.

mordax7 commented 1 year ago

Sadly the reboot did not help.

I am not so familiar with partitions and volumes, but I noticed that the COS_PERSISTENT has the same size on my RPI as it has when I mount the SD card in my Laptop On RPI:

root@localhost:~# df -h
Filesystem                         Size  Used Avail Use% Mounted on
tmpfs                              3.9G     0  3.9G   0% /dev/shm
tmpfs                              1.6G  9.9M  1.6G   1% /run
tmpfs                              5.0M     0  5.0M   0% /run/lock
/dev/loop0                         2.6G  2.6G     0 100% /
/dev/disk/by-label/COS_OEM          55M   15K   51M   1% /oem
tmpfs                              2.0G  476K  2.0G   1% /run/overlay
/dev/disk/by-label/COS_PERSISTENT   55M  8.7M   42M  18% /usr/local
overlay                            2.0G  476K  2.0G   1% /var
overlay                            2.0G  476K  2.0G   1% /etc
overlay                            2.0G  476K  2.0G   1% /srv
tmpfs                              3.9G     0  3.9G   0% /tmp
/dev/mmcblk0p2                     7.7G  5.2G  2.2G  71% /run/initramfs/cos-state
tmpfs                              786M  4.0K  786M   1% /run/user/0

On my Laptop:

❯ df -h
Filesystem        Size  Used Avail Use% Mounted on
/dev/sdb2         7.7G  5.2G  2.2G  71% /run/media/mordax/COS_STATE
/dev/sdb3         5.2G  2.6G  2.3G  54% /run/media/mordax/COS_RECOVERY
/dev/sdb1          95M  8.5M   87M   9% /run/media/mordax/COS_GRUB
/dev/sdb4          55M   15K   51M   1% /run/media/mordax/COS_OEM
/dev/sdb5          55M  8.7M   42M  18% /run/media/mordax/COS_PERSISTENT

Should it be like that?

jimmykarily commented 1 year ago

It should be the same on any machine, what you see is ok. The question is why didn't the filesystem expand. Are you sure you are installing v2.4.1? I can't reproduce this. On my rpi4 (64Gb sdcard):

kairos@localhost:~> lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
loop0     7:0    0    2G  1 loop /
mmcblk0 179:0    0 59.7G  0 disk 
├─mmcblk0p1
│       179:1    0   96M  0 part 
├─mmcblk0p2
│       179:2    0  6.1G  0 part /run/initramfs/cos-state
├─mmcblk0p3
│       179:3    0  4.1G  0 part 
├─mmcblk0p4
│       179:4    0   64M  0 part /oem
└─mmcblk0p5
        179:5    0 49.4G  0 part /usr/local/.state/usr-share-pki-trust.bind/anchors
                                 /usr/share/pki/trust/anchors
                                 /usr/share/pki/trust
                                 /var/lib/wicked
                                 /var/lib/snapd
                                 /var/lib/rancher
                                 /var/lib/longhorn
                                 /var/lib/kubelet
                                 /var/lib/extensions
                                 /var/lib/dbus
                                 /var/lib/containerd
                                 /var/lib/cni
                                 /var/lib/ca-certificates
                                 /etc/zfs
                                 /etc/systemd
                                 /etc/sysconfig
                                 /etc/ssh
                                 /var/snap
                                 /etc/runlevels
                                 /etc/rancher
                                 /etc/modprobe.d
                                 /var/log
                                 /usr/libexec
                                 /etc/kubernetes
                                 /etc/iscsi
                                 /etc/cni
                                 /root
                                 /opt
                                 /home
                                 /usr/local

kairos@localhost:~> df -h
Filesystem                         Size  Used Avail Use% Mounted on
devtmpfs                           4.0M     0  4.0M   0% /dev
tmpfs                              3.9G  8.0K  3.9G   1% /dev/shm
tmpfs                              1.6G   26M  1.6G   2% /run
tmpfs                              4.0M     0  4.0M   0% /sys/fs/cgroup
/dev/disk/by-label/COS_ACTIVE      2.0G  1.8G  110M  95% /
/dev/disk/by-label/COS_OEM          55M   15K   51M   1% /oem
tmpfs                              2.0G   12M  2.0G   1% /run/overlay
/dev/disk/by-label/COS_PERSISTENT   47G  951M   43G   3% /usr/local
overlay                            2.0G   12M  2.0G   1% /var
overlay                            2.0G   12M  2.0G   1% /etc
overlay                            2.0G   12M  2.0G   1% /srv
tmpfs                              3.9G  4.0K  3.9G   1% /tmp
/dev/mmcblk0p2                     5.9G  3.5G  2.2G  63% /run/initramfs/cos-state
shm                                 64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/3c3c2a41ba103b7a6428671c705e1e19734c751417196611422e9761ccf7af0a/shm
shm                                 64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/e5d3b6c77ea50b78c500a6e621e0af923acecdf8fcd8f414f2e99bc468b1e51d/shm
shm                                 64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/2865850d67bc8bcc10b30810262d5a6350e4629ee85a8180c21141f0f05315c8/shm
shm                                 64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/e6cd7b7f868cef127345e13a911cbcfd630b09697f3b336ea9ce5a165d44c6d2/shm
shm                                 64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/dd67a77b2ccc6841d74adc0520e5805aaa6db45b65bfb30e65c8e4110538128e/shm

My COS_PERSISTENT is the full size of the partition. I didn't even have to reboot or anything. My kairos config is as simple as it gets:

#cloud-config

users:
  - name: kairos
    passwd: kairos

k3s:
  enabled: true

jimmykarily commented 1 year ago

The tool we use to expand the filesystem depends on the filesystem type: https://github.com/kairos-io/kairos-agent/blob/40289af471124e9f3bb58965b9206b80bc01f144/pkg/partitioner/disk.go#L410-L450

I'm using defaults and it works. @mordax7 are you setting the filesystem to something non-default in your kairos config maybe? I'm suspecting that if you do, we are trying to use a tool that is not there or doesn't work for some other reason.

mordax7 commented 1 year ago

The image that I have used:

The command I used to flash the SD card: xzcat kairos-standard-ubuntu-arm64-rpi4-v2.4.1-k3sv1.27.3+k3s1.img.xz | sudo dd of=/dev/sda oflag=sync status=progress bs=10MB

The cloud-config that I have used:

The file-system type of the mounts:

Which RPI Model I use:

jimmykarily commented 1 year ago

My image was

kairos-standard-opensuse-leap-arm64-rpi4-v2.4.1-k3sv1.26.6+k3s1.img

(only the k3s version is different)

My config and the rpi model are the same as yours. My /dev/disk/by-label/COS_PERSISTENT is also ext4.

The process I used to burn the image was the same. I don't see how we could get different results.

You should add rd.debug rd.immucore.debug to the kernel boot line (by hitting e when the grub menu is presented and appending these to the linux line). You'd need a serial cable and a tool like minicom to be able to store the boot logs to a file. I'm not sure if they are accessible after boot with journalctl.

I usually do sudo minicom -b 115200 -o -D /dev/ttyUSB0 2>&1 | tee out and then check the out file.

There should be some error in the logs that will help.

mordax7 commented 1 year ago

I found some immocore logs in journalctl, It seems like a file system permission problem: journalctl.txt

I tried multiple SD cards and writing the image to the SD card with multiple different tools, nothing helped.

vadimzharov commented 1 year ago

I have exactly the same issue with kairos-core-ubuntu-22-lts-arm64-rpi4-v2.4.1.img image. I followed the guide:

root@ubuntu:/home/ubuntu# export IMAGE=quay.io/kairos/core-ubuntu-22-lts-arm-rpi-img:v2.4.1
root@ubuntu:/home/ubuntu# docker run -ti --rm -v $PWD:/image quay.io/luet/base util unpack "$IMAGE" /image

then wrote the image using RPI Imager, and after booting my RPI 4 I have:

root@localhost:/home/kairos# df -h
Filesystem                         Size  Used Avail Use% Mounted on
tmpfs                              1.9G     0  1.9G   0% /dev/shm
tmpfs                              767M  9.9M  757M   2% /run
tmpfs                              5.0M     0  5.0M   0% /run/lock
/dev/loop0                         2.6G  2.2G  322M  88% /
/dev/disk/by-label/COS_OEM          55M   15K   51M   1% /oem
tmpfs                              959M  480K  958M   1% /run/overlay
/dev/disk/by-label/COS_PERSISTENT   55M  7.1M   44M  15% /usr/local
overlay                            959M  480K  958M   1% /var
overlay                            959M  480K  958M   1% /etc
overlay                            959M  480K  958M   1% /srv
/dev/sda2                          7.7G  4.4G  3.0G  60% /run/initramfs/cos-state
tmpfs                              1.9G     0  1.9G   0% /tmp

/dev/disk-by-label/COS_PERSISTENT is tiny.

mordax7 commented 1 year ago

I tested it with a 64GB SD card(all SanDisk), did not help. Maybe it has something to do with the type of SD cards?

vadimzharov commented 1 year ago

I'm using a USB Flash drive 64G.

mr-onion-2 commented 1 year ago

I came across this thread a short while ago after having experienced the same issue (with K3S not running due to no space on the disk) with kairos-alpine-arm-rpi-img:v2.4.1-k3sv1.27.3-k3s1.

I have since installed kairos-opensuse-leap-arm-rpi-img:v2.4.1-k3sv1.27.3-k3s1 and the filesystem has been expanded successfully. I would prefer to use the Alpine version though.

_/dev/disk/by-label/COSPERSISTENT 214G 218M 203G 1% /usr/local

By the way, when I run kairos --version, it returns version: 2.4.0, compiled with: go1.20.8. I was expecting it to say 2.4.1 ?

Itxaka commented 1 year ago

Could be that alpine is missing a required binary on the initramfs needed to expand.

Itxaka commented 1 year ago

By the way, when I run kairos --version, it returns version: 2.4.0, compiled with: go1.20.8. I was expecting it to say 2.4.1 ?

This is expected as you are checking the version of kairos-agent.

For the Kairos version you need to check /etc/os-release

Kairos-agent and Kairos have different release cadence, currently they are kind of matched in version numbers due to coincidence, but you could have different versions altogether depending on their release cadence, so it could report version 2.2.12 of the agent in the Kairos version 2.1 for example :)

jimmykarily commented 1 year ago

@mordax7 I just realized you used ubuntu, the original reporter was using opensuse. I can reproduce this on Ubuntu. I see these on immucore boot logs:

1902 ^[[1;51r^[[50;1H[   74.744911] immucore[451]: time="2023-08-18T18:15:37Z" level=debug msg="Using label COS_PERSISTENT for layout expansion"^M^[[1;50r^[[50;1H
1903 ^[[1;51r^[[50;1H[   74.768531] immucore[451]: time="2023-08-18T18:15:37Z" level=debug msg="Output of udevadm settle: "^M^[[1;50r^[[50;1H
1904 ^[[1;51r^[[50;1H[   74.788476] immucore[451]: time="2023-08-18T18:15:37Z" level=debug msg="Got device /dev/mmcblk0 for label COS_PERSISTENT"^M^[[1;50r^[[50;1H
1905 ^[[1;51r^[[50;1H[   74.812510] immucore[451]: time="2023-08-18T18:15:37Z" level=warning msg="Not enough unpartitioned space in disk to operate"^M^[[1;50r^[[50;1H
1906 ^[[1;51r^[[50;1H[   74.836583] immucore[451]: time="2023-08-18T18:15:37Z" level=info msg="Done executing stage 'rootfs.after'\n"^M^[[1;50r^[[50;1H

There is still something off. As is this fix didn't work: https://github.com/mudler/yip/pull/110/files

Now at least I can debug.

jimmykarily commented 1 year ago

Fix finding, the partition expansions seems to happen twice:

This is wrong. Also in immucore debug logs, I see the following warning 3 times:

Not enough unpartitioned space in disk to operate

That said, the first time we try to expand the partition, it should succeed expanding the filesystem as well, otherwise some error should show up (I can't find any). Also, I don't see why it would Ubuntu behave differently from Opensuse. We rely on output from lsblk to detect partition device names so it could be that some lsblk output is different between the 2 OSes (e.g. due to different versions of it).

Itxaka commented 1 year ago

FYI found the same on alpine pure testing. Part is expanded (as show via lsblk) but filesystem is not (as shown by df) and no errors on the logs.

vda    253:0    0   176G  0 disk 
├─vda1 253:1    0     1M  0 part 
├─vda2 253:2    0    64M  0 part /oem
├─vda3 253:3    0   2.3G  0 part 
├─vda4 253:4    0   4.2G  0 part /run/initramfs/cos-state
└─vda5 253:5    0 169.4G  0 part /var/lib/wicked

kairos-i0n7:~$ df -h
Filesystem                         Size  Used Avail Use% Mounted on
/dev/disk/by-label/COS_ACTIVE      1.1G  599M  421M  59% /
/dev/disk/by-label/COS_OEM          55M   50K   51M   1% /oem
/dev/disk/by-label/COS_PERSISTENT   69G  3.8M   66G   1% /usr/local

According to this: https://github.com/mudler/yip/blob/master/pkg/plugins/layout.go#L507 if there is an error it should return it and the stage would show it...I think?

Maybe we would need to add even more logging to layout yip plugin...

EDIT: If you run resize2fs /dev/vda5 directly then it grows the filesystem with no issues....

kairos-i0n7:~$ sudo resize2fs /dev/vda5
resize2fs 1.47.0 (5-Feb-2023)
[  196.880453] EXT4-fs (vda5): resizing filesystem from 18201083 to 44415483 blocks
Filesystem at /dev/vda5 is mounted on /usr/local; on-line resizing required
old_desc_blocks = 9, new_desc_blocks = 22
[  196.935976] EXT4-fs (vda5): resized filesystem to 44415483
The filesystem on /dev/vda5 is now 44415483 (4k) blocks long.

Itxaka commented 1 year ago

this is what I got from a extended debug enabled yip:

[[90m2023-10-16T13:52:58Z[[0m [[33mDBG[[0m Output from running e2fsck e2fsck 1.4
7.0 (5-Feb-2023)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
COS_PERSISTENT: 15/4521984 files (0.0% non-contiguous), 428763/18063872 blocks

[[90m2023-10-16T13:52:58Z[[0m [[33mDBG[[0m Output from running resize2fs resize2
fs 1.47.0 (5-Feb-2023)
The filesystem is already 18063872 (4k) blocks long.  Nothing to do!

[[90m2023-10-16T13:52:58Z[[0m [[33mDBG[[0m Trying to reread the partition table 
of /dev/vda (try number 1)
[[90m2023-10-16T13:52:58Z[[0m [[33mDBG[[0m output of partprobe: 
[[90m2023-10-16T13:52:58Z[[0m [[33mDBG[[0m Output of sync:

Looks like the partition table is not properly re-read or something and resize2fs thinks that its already maxed out?

Itxaka commented 1 year ago

expansion migth be fixed after https://github.com/mudler/yip/pull/116 gets in and down into immucore

mordax7 commented 1 year ago

Did someone find a workaround which can be used in the cloud-config in the meantime?

Because after the RPI booted I ssh'ed and used the resize2fs /dev/mmcblk0p5 as described here to manually resize the the partition. Which does resize the partition. But I do not know if the cloud-config actually managed to configure k3s, I think I am missing all the k3s configuration, but I am not 100% sure. Is there a way for me to verify that my cloud-config actually configures something?

One example, i want to use KubeVIP as described here. In theory, the IP I configured under eip, should be a network interface on the controller node and k3s should have that IP configured, but it is not. Any ideas?

jimmykarily commented 1 year ago

@mordax7 to see if k3s was configured I think you should start by directly looking into the k3s service: systemctl status k3s and journalctl -u k3s. If the service is started correctly but you still can't tell if all your options were applied, you should look in how the k3s service is started (flags and such). A specific option you are missing might help us point you to more exact debugging steps.

jimmykarily commented 1 year ago

This is going to be fixed with v2.4.2 except for Alpine for which we have created a separate issue: https://github.com/kairos-io/kairos/issues/1995

I will close this and track the Alpine fix on the new issue.

kairos-io / kairos

k3s fails to start on Raspberry Pi #1599