openshift / os

89 stars 107 forks source link

aarch64 multi-arch builds fail due to no disk space left on builder #1554

Open marmijo opened 1 month ago

marmijo commented 1 month ago

We've been hitting storage issues on the aarch64 multi-arch builder lately and it's causing our builds to fail with a message similar to, but not limited to, the following:

[2024-07-17T16:24:06.840Z] Committing 01fcos: /home/jenkins/agent/workspace/build-arch/src/config/overlay.d/01fcos ... error: Writing content object: min-free-space-percent '3%' would be exceeded, at least 4.1?kB requested

OR

OSError: [Errno 28] No space left on device: 

OR

qemu-img: error while writing at byte 2859466752: No space left on device

I was able to log into the aarch64 builder today as the builder user and I found /sysroot at 100% usage.

core@coreos-aarch64-builder:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p4  200G  200G  2.2M 100% /sysroot
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           126G  200K  126G   1% /dev/shm
...
...

I freed up some space today by running podman volume prune after noticing that most of the storage space was being used by those volumes.

builder@coreos-aarch64-builder:~$ podman volume prune
WARNING! This will remove all volumes not used by at least one container. The following volumes will be removed:
04ca0c2da268f19d45440991aebc0ca9f2518c09f2a0dcdbeae66cccc563a521
11e3d74469587125fd71ce12e2d84cf6210363e1ce50c432e5ac0da098089a2b
164a592f879a706839806895605af1b1e599c82a54d7a7e9cd1b11421f4201bb
f5fa83bd6c333d4e302f180c5aa838217c2cb41e98186b98ddaf2b92d83022bc
Are you sure you want to continue? [y/N] y
04ca0c2da268f19d45440991aebc0ca9f2518c09f2a0dcdbeae66cccc563a521
11e3d74469587125fd71ce12e2d84cf6210363e1ce50c432e5ac0da098089a2b
164a592f879a706839806895605af1b1e599c82a54d7a7e9cd1b11421f4201bb
f5fa83bd6c333d4e302f180c5aa838217c2cb41e98186b98ddaf2b92d83022bc
builder@coreos-aarch64-builder:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p4  200G   92G  109G  46% /sysroot
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           126G  400K  126G   1% /dev/shm
efivarfs        512K  4.6K  508K   1% /sys/firmware/efi/efivars
tmpfs            51G  9.9M   51G   1% /run
tmpfs           126G     0  126G   0% /tmp
/dev/nvme0n1p3  350M  265M   62M  82% /boot
tmpfs            26G  452K   26G   1% /run/user/1001
tmpfs            26G   60K   26G   1% /run/user/1002
tmpfs            26G   16K   26G   1% /run/user/1000

Hopefully this will be mitigated once we redeploy the multi-arch builders on AWS and increase the size of the disk to at least 600GB from 200GB. While not necessary to redeploy the builder, landing https://github.com/coreos/fedora-coreos-pipeline/pull/986 would make it much easier. However, it might be worth exploring if we can reduce/prevent the number of dangling volumes on the builders.

jlebon commented 1 month ago

The volumes get cleaned up by https://github.com/coreos/fedora-coreos-pipeline/blob/ddadc038aa99692b346b422c21ede0436cd55de3/multi-arch-builders/builder-common.bu#L81, which runs daily. But I think what can happen is if too many jobs fail too quickly, we blow through the 200G limit before we even make it to the next prune.