openshift / os

89 stars 107 forks source link

aarch64 multi-arch builds fail due to no disk space left on builder #1554

Open marmijo opened 1 month ago

marmijo commented 1 month ago

We've been hitting storage issues on the aarch64 multi-arch builder lately and it's causing our builds to fail with a message similar to, but not limited to, the following:

[2024-07-17T16:24:06.840Z] Committing 01fcos: /home/jenkins/agent/workspace/build-arch/src/config/overlay.d/01fcos ... error: Writing content object: min-free-space-percent '3%' would be exceeded, at least 4.1?kB requested


OSError: [Errno 28] No space left on device: 


qemu-img: error while writing at byte 2859466752: No space left on device

I was able to log into the aarch64 builder today as the builder user and I found /sysroot at 100% usage.

core@coreos-aarch64-builder:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p4  200G  200G  2.2M 100% /sysroot
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           126G  200K  126G   1% /dev/shm

I freed up some space today by running podman volume prune after noticing that most of the storage space was being used by those volumes.

builder@coreos-aarch64-builder:~$ podman volume prune
WARNING! This will remove all volumes not used by at least one container. The following volumes will be removed:
Are you sure you want to continue? [y/N] y
builder@coreos-aarch64-builder:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p4  200G   92G  109G  46% /sysroot
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           126G  400K  126G   1% /dev/shm
efivarfs        512K  4.6K  508K   1% /sys/firmware/efi/efivars
tmpfs            51G  9.9M   51G   1% /run
tmpfs           126G     0  126G   0% /tmp
/dev/nvme0n1p3  350M  265M   62M  82% /boot
tmpfs            26G  452K   26G   1% /run/user/1001
tmpfs            26G   60K   26G   1% /run/user/1002
tmpfs            26G   16K   26G   1% /run/user/1000

Hopefully this will be mitigated once we redeploy the multi-arch builders on AWS and increase the size of the disk to at least 600GB from 200GB. While not necessary to redeploy the builder, landing would make it much easier. However, it might be worth exploring if we can reduce/prevent the number of dangling volumes on the builders.

jlebon commented 1 month ago

The volumes get cleaned up by, which runs daily. But I think what can happen is if too many jobs fail too quickly, we blow through the 200G limit before we even make it to the next prune.