kubernetes-sigs / kind

Kubernetes IN Docker - local clusters for testing Kubernetes
https://kind.sigs.k8s.io/
Apache License 2.0
13.51k stars 1.56k forks source link

Support Kubernetes UserNamespacesSupport alpha feature gate #3436

Open dgl opened 11 months ago

dgl commented 11 months ago

What happened:

I'm working on parts of the Kubernetes user namespace support (currently an alpha feature). I'd like to use kind for testing it.

I enabled the UserNamespacesSupport feature gate. Pods that set hostUsers: false fail with:

Warning  FailedCreatePodSandBox  6s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox "75fd33edcf39433911025ac0e045581bd19688190cd1e5f7166d279056dc592c": failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "sysfs" to rootfs at "/sys": mount src=sysfs, dst=/sys, dstFD=/proc/self/fd/10, flags=0xf: operation not permitted: unknown

After fixing that (below), I also saw:

Warning  Failed                  5s    kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running createContainer hook #0: fork/exec /kind/bin/mount-product-files.sh: permission denied: unknown

What you expected to happen:

Sweet user namespace based isolation.

How to reproduce it (as minimally and precisely as possible):

Update runc to main in the base image, but also set runc_nodmz (because of the bug I reported in https://github.com/opencontainers/runc/issues/4125):

--- a/images/base/Dockerfile
+++ b/images/base/Dockerfile
@@ -135,13 +135,13 @@ RUN git clone --filter=tree:0 "${CONTAINERD_CLONE_URL}" /containerd \
 # stage for building runc
 FROM go-build as build-runc
 ARG TARGETARCH GO_VERSION
-ARG RUNC_VERSION="v1.1.9"
+ARG RUNC_VERSION="main"
 ARG RUNC_CLONE_URL="https://github.com/opencontainers/runc"
 RUN git clone --filter=tree:0 "${RUNC_CLONE_URL}" /runc \
     && cd /runc \
     && git checkout "${RUNC_VERSION}" \
     && eval "$(gimme "${GO_VERSION}")" \
-    && export GOARCH=$TARGETARCH && export CC=$(target-cc) && export CGO_ENABLED=1 \
+    && export GOARCH=$TARGETARCH && export CC=$(target-cc) && export CGO_ENABLED=1 && export EXTRA_BUILDTAGS=runc_nodmz \
     && make runc \
     && GOARCH=$TARGETARCH go-licenses save --save_path=/_LICENSES .

Also use containerd v2.0.0-pre version. make quick, build a node image based on a recent Kubernetes (something like kind build node-image ~/Code/kubernetes --image kindest/node:runc-main --base-image=gcr.io/k8s-staging-kind/base:v20231124-6a461ab5-dirty).

Create a kind cluster with:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
   "UserNamespacesSupport": true
nodes:
- role: control-plane
  image: kindest/node:latest

Run a pod something like:

apiVersion: v1
kind: Pod
metadata:
  name: userns
spec:
  restartPolicy: Never
  hostUsers: false
  containers:
  - name: debian
    image: debian
    command: ["sh"]
    args: ["-c", "sleep infinity"]

Fixes

sysfs

The first sysfs mount failed can be fixed by running:

docker exec kind-control-plane sh -c "mkdir /mnt/sysfs; mount -t sysfs none /mnt/sysfs"

This is because sysfs is mounted with "masks" -- the /sys/devices/virtual/dmi/id/product_name files which kind bind mounts over, except in that case the kernel does not let us mount a sysfs filesystem in a user namespace, because it is seen as masked. By (additionally) mounting sysfs elsewhere we can make the kernel's check succeed.

(Still needs some thought/testing as to whether that should be readonly or readwrite, I suspect it should be rw, but that does seem to go against systemd's container interface, but for good reason.)

/kind/bin permissions

This just looks like a Dockerfile mistake, the directory isn't executable. A simple:

docker exec kind-control-plane chmod 755 /kind/bin

Fixes it.

Anything else we need to know?:

Mostly filing an issue for tracking and so other people might find this based on errors, if they try to use it. I'll open some PRs.

Environment:

BenTheElder commented 11 months ago

I need to read on the runc DMZ option, we avoid non-defaults since kind is for testing the project first and foremost, the other build options we set elsewhere so far are compiling out unused snapshotters or things of that nature.

the directory permissions seem like an oversight

more generally we intend to upgrade runc + containerd but have to be careful about it. I'm sure we'll get on it eventually but we normally only get on prerelease versions when we need a critical bug fix

Andreagit97 commented 2 months ago

I faced the same failure with a similar setup.

  1. I created a custom kind base image with the following changes
diff --git a/images/base/Dockerfile b/images/base/Dockerfile
index 63060aee..5f1e6832 100644
--- a/images/base/Dockerfile
+++ b/images/base/Dockerfile
@@ -122,7 +122,7 @@ RUN eval "$(gimme "${GO_VERSION}")" \
 # stage for building containerd
 FROM go-build AS build-containerd
 ARG TARGETARCH GO_VERSION
-ARG CONTAINERD_VERSION="v1.7.18"
+ARG CONTAINERD_VERSION="v2.0.0-rc.3"
 ARG CONTAINERD_CLONE_URL="https://github.com/containerd/containerd"
 # we don't build with optional snapshotters, we never select any of these
 # they're not ideal inside kind anyhow, and we save some disk space
@@ -140,7 +140,7 @@ RUN git clone --filter=tree:0 "${CONTAINERD_CLONE_URL}" /containerd \
 # stage for building runc
 FROM go-build AS build-runc
 ARG TARGETARCH GO_VERSION
-ARG RUNC_VERSION="v1.1.13"
+ARG RUNC_VERSION="v1.2.0-rc.2"
 ARG RUNC_CLONE_URL="https://github.com/opencontainers/runc"
 RUN git clone --filter=tree:0 "${RUNC_CLONE_URL}" /runc \
     && cd /runc \
  1. I created a new kind node image based on the above base image and k8s v1.30
kind build node-image  --base-image="..." --type release v1.30.0
  1. I used the following config to create a kind cluster
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
   "UserNamespacesSupport": true
nodes:
- role: control-plane
  image: <above-bulit-image>
  1. I created this pod in the cluster
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  hostUsers: false
  containers:
  - name: nginx
    image: nginx:1.27.0
    ports:
    - containerPort: 80

The kubelet reported the following error (the same one described in the initial issue)

 Warning  FailedCreatePodSandBox  2m53s             kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox "a8e2d0f7722c1bcbe361325dc1c264c6d0fe524d3a3214a387c5494bfd83fccd": failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "sysfs" to rootfs at "/sys": mount src=sysfs, dst=/sys, dstFd=/proc/thread-self/fd/8, flags=0xf: operation not permitted: unknown

I can confirm that the workaround provided by @dgl fixes the issue

docker exec kind-control-plane sh -c "mkdir /mnt/sysfs; mount -t sysfs none /mnt/sysfs"

To be honest, after this issue, I faced another one (exactly this https://github.com/containerd/containerd/issues/10598) but this has probably nothing to do with KinD!

Environment