threefoldtech / zos

Autonomous operating system
https://threefold.io/host/
Apache License 2.0
79 stars 12 forks source link

Some nodes rootfs is full #2337

Closed AbdelrahmanElawady closed 2 weeks ago

AbdelrahmanElawady commented 1 month ago

Description

Some nodes on devnet have a problem updating and deploying workloads due to rootfs being filled up. After inspecting some nodes, it turned out the issue is due to the way ZOS handles updates. So, whenever the node decides to update its packages old files get removed. However, these files are not completely removed from rootfs due to them being used by another processes. For example: cloud-hypervisor, virtiofsd, containerd, rfs, etc... These processes are related to users workloads so we can't just stop them or restart them and with time these files fill up the rootfs and eventually no space left on rootfs.

Possible Solutions

Since we can't remove these old files (because they are related to user workloads), we can try to minimize the rate of this situation occurring. For example: we can check before writing the content of a new package that packages have different version than the one on the node. That way if it's the same package we won't create these deleted-but-still-used files.

AbdelrahmanElawady commented 1 month ago

of course it is a problem on all networks but is just appeared on devnet.

muhamadazmy commented 1 month ago

I would like to add that most services are restarted, so all zos binaries has no issue, containerd is also not an issue (and is restarted) but any user related process (usually not managed by zinit but by one of zos daemons) are not other wise user workload will get a downtime (like cloud-hypervisor, virtiofsd, rfs, etc...) which causes this problem

If a node is running for a really long time and went through many updates, and if the node has long running user workloads there workloads endup holding the files they used to start (say cloud-hypervisor binary)