loft-sh / vcluster

vCluster - Create fully functional virtual Kubernetes clusters - Each vcluster runs inside a namespace of the underlying k8s cluster. It's cheaper than creating separate full-blown clusters and it offers better multi-tenancy and isolation than regular namespaces.
https://www.vcluster.com
Apache License 2.0
6.31k stars 402 forks source link

deployment vcluster KO in Kubernetes with noexec for emptyDir #1717

Open antoinetran opened 5 months ago

antoinetran commented 5 months ago

What happened?

In an environment where any emptyDir is mounted to a partition in host, with noexec, vcluster create will give:

12:07:17 warn Pod my-vcluster-795748b48b-gzbvb: Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "
/binaries/vcluster": permission denied: unknown (Failed)

After editing the pod for debug with strace:

/ # /binaries/vcluster
sh: /binaries/vcluster: Permission denied
/ # strace /binaries/vcluster
execve("/binaries/vcluster", ["/binaries/vcluster"], [/* 27 vars */]) = -1 EACCES (Permission denied)
writev(2, [{iov_base="strace: exec: Permission denied", iov_len=31}, {iov_base="\n", iov_len=1}], 2strace: exec: Permission denied
) = 32
writev(2, [{iov_base="", iov_len=0}, {iov_base=NULL, iov_len=0}], 2) = 0
getpid()                                = 18
exit_group(1)                           = ?
+++ exited with 1 +++

If copied to /tmp, vcluster works.

Mount command gives:

# for /tmp
mount | grep "on / "
overlay on / type overlay (rw,seclabel,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/26481/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/26480/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/26558/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/26558/work)

# for /binaries
mount | grep "on /binaries "
/dev/sda7 on /binaries type ext4 (rw,seclabel,nosuid,nodev,noexec,relatime,stripe=64)

Which shows noexec for /binaries (but not for /tmp though).

What did you expect to happen?

vcluster create is OK

How can we reproduce it (as minimally and precisely as possible)?

Deploy a kubernetes cluster and configures it to bind any emptyDir to a partition with noexec. Then deploy vcluster.

Anything else we need to know?

Currently, it seems this behavior is particular to the Kubernetes environment I am deploying it into. Generally speaking, it seems the emptyDir are not mounted as noexec. However seeing https://github.com/kubernetes/kubernetes/issues/48912 , it seems we are going in the direction of more security with emptyDir mounted as noexec (by default or with options).

From my understanding of the code (see https://github.com/loft-sh/vcluster/blob/v0.20.0-beta.1/chart/templates/_init-containers.tpl), the initContainers are here to inject vcluster, only to do a cp command (because the cp is not present in the kubernetes images), to get kube-controller-manager and kube-apiserver binaries into vcluster image. This needs emptyDir mounted as exec.

Host cluster Kubernetes version

```console $ kubectl version kubectl version Client Version: v1.28.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.26.4 ```

Host cluster Kubernetes distribution

kubespray

vlcuster version

```console $ vcluster --version vcluster version 0.20.0-beta.1 ```

Vcluster Kubernetes distribution(k3s(default)), k8s, k0s)

``` k8s ```

OS and Arch

``` OS: Linux Arch: amd64 ```
antoinetran commented 5 months ago

I could ask the kubernetes admin if they can change the behavior of emptyDir, so that the partition are not noexec. It might be difficult for them to lower the security. Moreover, the kubernetes issue https://github.com/kubernetes/kubernetes/issues/48912 might make this a future issue for vcluster anyway.

What if vcluster image directly contains the two binaries? I don't know about licence but that would prevent this trick and we could then have noexec in the image and in emptyDir.

facchettos commented 5 months ago

@antoinetran Hi, thanks for opening this. to answer your question What if vcluster image directly contains the two binaries? the issue here is that we default to the current k8s version of the host (e.g. if you're on 1.27 in the host cluster the image will be pulled from k8s 1.27) and this is also configurable. So we would have to have at least 4 different images just for the k8s distro, plus the images would have to also include the scheduler and the controller even if not in use and BYOI would be harder too The issue you linked may be a problem indeed for this approach, I will be taking a look