threefoldtech / zos

Autonomous operating system
https://threefold.io/host/
Apache License 2.0
85 stars 14 forks source link

K8s is not working when zvolums are used #2327

Open ashraffouda opened 6 months ago

ashraffouda commented 6 months ago

Describe the bug

Deployment of k8s cluster is broken when zvolumes are used while it is working properly when zmounts are used It gives this error

[+] k3s: time="2024-05-12T12:01:06Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: \"overlayfs\" snapshotter cannot be enabled for \"/mnt/data/agent/containerd\", try using \"fuse-overlayfs\" or \"native\": failed to mount overlay: invalid argument"

To Reproduce

Deploy k8s cluster with zvolumes

muhamadazmy commented 6 months ago

If the k8s flist uses the obsolete raw image, the first attached "disk" to the vm MUST be a zmount, not a volume. Extra "volumes" can be added to the VM. Then you can only mount them if the k8s has the virtiofs module.

the right way to do this actually is now modify the k8s image to use the preferred flist style with individual files.

AbdelrahmanElawady commented 6 months ago

k3s image is not a VM, as it doesn't have a kernel with it. However, it turned out that overlayfs has some issues with virtiofs as the upper layer. and since container runtimes usually use overlayfs, basically all of them won't work with the new Volumes.

There are kernel batches for running virtiofs with overlayfs but I believe it will make it harder for users to create custom images with these patches. So, we might need to revise the way Volumes work.

scottyeager commented 3 months ago

So the incompatibility between virtiofs and overlayfs has been understood for a while in the context of running Docker inside micro VMs and trying to use the virtiofs based rootfs for Docker's data dir. Docker tends to automatically fall back to the vfs driver and continue operating, but performance is very bad. Placing Docker's data dir on a disk (raw image type) fixes this (and conforms to the intended design of storing user data on a disk/volume). If we intend to deprecate that form in favor of the new virtiofs based volume, then we won't have this workaround.

As suggested in the error message in the original post, using the fuse-overlayfs driver can be another alternative. That probably has better performance than vfs but is still going to be a performance hit over using a non fuse driver. Maybe this could be acceptable for many use cases where performance sensitive data can be stored in a volume attached to the container (since container volumes don't use the same storage driver as used for the container rootfs).

I reviewed the discussions around improving compatibility for virtiofs and overlayfs. For reference, this issue contains the best overview of the situation.

There are kernel patches for running virtiofs with overlayfs

It seems that these patches were merged into the mainline kernel as of 5.7. Seems what we're missing are the other pieces of the puzzle mentioned in this comment on the issue linked above:

# we absolutely need xattr and sys_admin cap
# allow_direct_io just seems sensible but is not required
# we had been using -o writeback which improved performance however users were reporting problems so removed it
virtio_fs_extra_args = ["-o", "xattr", "-o", "modcaps=+sys_admin", "-o", "allow_direct_io"]

Also:

One thing we've figured out (again with help from RHers above) is that to create an overlayfs in virtiofs your bottom layer must not also be overlay -- (e.g. it needs to be ext4, xfs, etc).

Based on my read of https://github.com/threefoldtech/zos/issues/1564, that suggests that our current implementation of rootfs is ruled out, since it's virtiofs backed by overlayfs, but we should be able to get this working for volumes, assuming they are just btrfs underneath.

One question then is whether it's acceptable to give CAP_SYS_ADMIN to virtiofsd.