kubernetes-sigs / kind

Kubernetes IN Docker - local clusters for testing Kubernetes
https://kind.sigs.k8s.io/
Apache License 2.0
13.2k stars 1.53k forks source link

Kind cluster fails to provision PV when a USB device was removed from the machine #3389

Open adelton opened 10 months ago

adelton commented 10 months ago

What happened:

I'm running Kind (with export KIND_EXPERIMENTAL_PROVIDER=podman) on my laptop. When I start the cluster while a mouse is connected to the machine, I'm able to create a pod with a local volume. Once I remove that mouse, this starts to fail.

The same issue happens when I close the lid to have the laptop go to sleep, and then wake it up again.

What you expected to happen:

Setup of PVCs and PVs continues to work.

How to reproduce it (as minimally and precisely as possible):

  1. export KIND_EXPERIMENTAL_PROVIDER=podman
  2. lsusb returns something like
    Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
    Bus 003 Device 003: ID 13d3:5405 IMC Networks Integrated Camera
    Bus 003 Device 044: ID 06cb:00f9 Synaptics, Inc. 
    Bus 003 Device 046: ID 0458:0007 KYE Systems Corp. (Mouse Systems) Trackbar Emotion
    Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
    Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
    Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
  3. kind create cluster
  4. Have a YAML file duplicating the standard storageclass under the name local-path, something like cat storageclass-local-path.yaml
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
    name: local-path
    namespace: kube-system
    annotations:
    storageclass.kubernetes.io/is-default-class: "false"
    provisioner: rancher.io/local-path
    volumeBindingMode: WaitForFirstConsumer
    reclaimPolicy: Delete
  5. kubectl apply -f storageclass-local-path.yaml
  6. kubectl apply -k 'https://github.com/rancher/local-path-provisioner/examples/pod-with-local-volume'
  7. After a small while, kubectl get pods -A show volume-test in namespace default as Running.
  8. kubectl delete -k 'https://github.com/rancher/local-path-provisioner/examples/pod-with-local-volume'
  9. Disconnect that USB mouse.
  10. Check with lsusb that the device 003/046 or whatever ids it had is no longer there.
  11. kubectl apply -k 'https://github.com/rancher/local-path-provisioner/examples/pod-with-local-volume'
  12. kubectl get pods -A shows
    NAMESPACE            NAME                                                         READY   STATUS       RESTARTS   AGE
    default              volume-test                                                  0/1     Pending      0          9s
    [...]
    local-path-storage   helper-pod-create-pvc-1e7e0729-1ec4-4b0e-91ef-3c41e0495783   0/1     StartError   0          9s
  13. kubectl events -n local-path-storage deployment/local-path-provisioner shows
    42s         Warning   Failed              Pod/helper-pod-create-pvc-1e7e0729-1ec4-4b0e-91ef-3c41e0495783   Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error creating device nodes: mount /dev/bus/usb/003/046:/run/containerd/io.containerd.runtime.v2.task/k8s.io/helper-pod/rootfs/dev/bus/usb/003/046 (via /proc/self/fd/6), flags: 0x1000: no such file or directory: unknown

Anything else we need to know?:

I actually first encountered it when I suspended the laptop and then woken it up and wanted to continue using the Kind cluster. The Bus 003 Device 044: ID 06cb:00f9 Synaptics, Inc. device gets a different device id upon wakeup.

Environment:

aojea commented 10 months ago

I don't have very clear from the description ... is an error from the local-path-provisioner or is any pod in kind that does not work?

adelton commented 10 months ago

The error comes from containerd attempting to start the helper-pod-create-pvc-1e7e0729-1ec4-4b0e-91ef-3c41e0495783 that gets initiated by the local-path-provisioner-6bc4bddd6b-rnsqd to fulfill the PVC request that comes from https://github.com/rancher/local-path-provisioner/blob/master/examples/pvc-with-local-volume/pvc.yaml.

aojea commented 10 months ago

is a https://github.com/rancher/local-path-provisioner bug then?

adelton commented 10 months ago

I don't think the code in local-path-provisioner does much with setting up the root fs and the mount points for the pod.

This seems to be related to how the "nodes" are created and represented by Kind / init / containerd / something and what they assume and inherit.

aojea commented 10 months ago

I don't have very clear from the description ... is an error from the local-path-provisioner or is any pod in kind that does not work?

That is why I asked this, is this with any pod or only with this specific pod?

adelton commented 10 months ago

Ah, you meant if there is something wrong about that specific example? Not really, when I turn it into a trivial busybox container with

apiVersion: v1
kind: Pod
metadata:
  name: volume-test-2
spec:
  containers:
  - name: volume-test-2
    image: busybox
    imagePullPolicy: IfNotPresent
    command:
    - mount
    volumeMounts:
    - name: volv2
      mountPath: /data2
  volumes:
  - name: volv2
    persistentVolumeClaim:
      claimName: local-volume-pvc-2
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: local-volume-pvc-2
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Mi

I get the very same error message once the list of USB devices changes.

aojea commented 10 months ago

what I'm trying to understand is if it is a general problem or only happens because of the PersistentVolumes

adelton commented 10 months ago

I only saw it with that helper pod. When I apply a pod without any volumes

apiVersion: v1
kind: Pod
metadata:
  name: no-volume
spec:
  containers:
  - name: no-volume
    image: busybox
    imagePullPolicy: IfNotPresent
    command:
    - mount

the pod and container get created and run fine. The mount output shows a very limited set of things mounted under /dev/ in that case:

$ kubectl logs pod/no-volume | grep ' on /dev/'
devpts on /dev/pts type devpts (rw,seclabel,nosuid,noexec,relatime,gid=524292,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,seclabel,nosuid,nodev,noexec,relatime)
/dev/mapper/vg_machine-lv_containers on /dev/termination-log type ext4 (rw,seclabel,relatime)
shm on /dev/shm type tmpfs (rw,seclabel,nosuid,nodev,noexec,relatime,size=65536k,uid=2000,gid=2000,inode64)
devtmpfs on /dev/null type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/random type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/full type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/tty type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/zero type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/urandom type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
adelton commented 10 months ago

To debug, when I

kubectl edit -n local-path-storage cm local-path-config

and change image to busybox and add a mount and sleep to setup with

    apiVersion: v1
    kind: Pod
    metadata:
      name: helper-pod
    spec:
      containers:
      - name: helper-pod
        image: busybox
        imagePullPolicy: IfNotPresent
  setup: |-
    #!/bin/sh
    set -eu
    mount
    sleep 30
    mkdir -m 0777 -p "$VOL_DIR"

and

kubectl rollout restart deployment local-path-provisioner -n local-path-storage

provisioning the pod with a PVC shows huge number of bind (?) mounts:

 kubectl logs -n local-path-storage helper-pod-create-pvc-59b95912-a254-454b-b26b-889c10b217c6 | grep ' on /dev/'
devpts on /dev/pts type devpts (rw,seclabel,nosuid,noexec,relatime,gid=524292,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,seclabel,nosuid,nodev,noexec,relatime)
/dev/mapper/vg_machine-lv_containers on /dev/termination-log type ext4 (rw,seclabel,relatime)
shm on /dev/shm type tmpfs (rw,seclabel,nosuid,nodev,noexec,relatime,size=65536k,uid=2000,gid=2000,inode64)
devtmpfs on /dev/acpi_thermal_rel type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/autofs type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/btrfs-control type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/001/001 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/002/001 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/003/001 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/003/003 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/003/050 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/004/001 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/0/cpuid type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/0/msr type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/1/cpuid type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/1/msr type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/2/cpuid type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/2/msr type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/3/cpuid type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/3/msr type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/4/cpuid type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
[...]
devtmpfs on /dev/watchdog type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/watchdog0 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/zero type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/zram0 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)

So something is different between the "normal" pods/containers and the pod/container created as the helper for the local-path provisioner.

BenTheElder commented 10 months ago

We don't control the device mounts being propagated from the host to the "node", that's podman.

The helper pod is privileged which is why it is also seeing all the mounts, unlike your simple test pod. https://github.com/rancher/local-path-provisioner/blob/4d42c70e748fed13cd66f86656e909184a5b08d2/provisioner.go#L553

adelton commented 10 months ago

Thanks for that pointer -- I confirm that when I add

    securityContext:
      privileged: true

to my regular container, I get the same issues as with the local-path helper.

What I'd like to figure out though: you say "we don't control the device mounts being propagated from the host to the "node"". But in this case it is not propagation of the device mounts from the host because on the host the /dev/bus/usb/*/* device is no longer there. So it is being propagated from something else, possibly some parent (?) pod (?) that has a list of devices that it once saw?

BenTheElder commented 10 months ago

IIRC docker/podman will sync all the /dev entries on creating the container, but there is not mount propagation to reflect updated entries. Then the nested containerd/runc will try to create these for the "inner" pod containers.

I don't think there are great solutions here ... maybe we can find a way to detect these "dangling" mounts and remove them from the node or hook the inner runc.

FWIW kind clusters are meant to be disposable and quick to create so maybe recreate after changing devices :/

BenTheElder commented 10 months ago

The opposite is a known issue with docker: "privileged containers do not reflect newly added host devices" has been a longstanding issue as I recall. We should look at what workarounds people are using for this since it's more or less the same root issue: https://github.com/moby/moby/issues/16160

adelton commented 10 months ago

Well realistically I'd be OK to just disable any propagation of /dev/bus/usb to the containers, either the first one (podman), or the next layer (containerd?). Is the search for the devices somehow configurable in either of those cases?

BenTheElder commented 10 months ago

Well realistically I'd be OK to just disable any propagation of /dev/bus/usb to the containers, either the first one (podman), or the next layer (containerd?). Is the search for the devices somehow configurable in either of those cases?

No, we're not even telling podman/docker to pass through these to the node, it's implicit with --privileged which we need to run Kubernetes/containerd.

Ditto with the privileged pods. Everything under dev gets passed through IIRC*

* a TTY for the container may be setup specially.

adelton commented 10 months ago

So with some experimentation, I got the setup working with

--- a/images/base/files/etc/containerd/config.toml
+++ b/images/base/files/etc/containerd/config.toml
@@ -19,6 +19,9 @@ version = 2
   runtime_type = "io.containerd.runc.v2"
   # Generated by "ctr oci spec" and modified at base container to mount poduct_uuid
   base_runtime_spec = "/etc/containerd/cri-base.json"
+
+  privileged_without_host_devices = true
+
   [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
     # use systemd cgroup by default
     SystemdCgroup = true

and rebuilding the base and node images.

I tested it with a rootless podman and both pods with PVs and running a privileged pod works, both with the USB unplug use-case and suspending the laptop and waking it up. I did not try any additional tests to see what this might break. If I file this as a pull request, will you allow the tests to run to see what it discovers in the general Kind testing / CI?

Now the questions is if / how to make this available in Kind in general, what the default should be, and what mechanism to provide for people to override it.

Given not having those devices in the privileged containers seems like a safer default, and with https://github.com/moby/moby/issues/16160 unaddressed hotplugging of devices does not work with docker anyway, I'd lean towards having true (no host devices) as the default.

But what should people use to override it?

Mounting the config.toml via extraMounts does not work because it gets manipulated at least in https://github.com/kubernetes-sigs/kind/blob/main/images/base/files/usr/local/bin/entrypoint.

We could add another KIND_EXPERIMENTAL_CONTAINERD_ variable and amend that sed -i logic to use it.

We could also use

imports = ["/etc/containerd/config.d/*.toml"]

and document extraMounts-ing any overrides into that directory. In fact, the configure_containerd in https://github.com/kubernetes-sigs/kind/blob/main/images/base/files/usr/local/bin/entrypoint could use that mechanism instead of that sed -i approach as well.

I don't want to make a change like moving from that sed -i to drop-in snippets just for this device-mounting issue ... but I'd be happy to provide a PR do switch to the drop-in snippets approach if it is viewed as a useful approach in general.

BenTheElder commented 10 months ago

Now the questions is if / how to make this available in Kind in general, what the default should be, and what mechanism to provide for people to override it.

I suspect this would break a LOT of users doing interesting driver development.

Given not having those devices in the privileged containers seems like a safer default, and with https://github.com/moby/moby/issues/16160 unaddressed hotplugging of devices does not work with docker anyway, I'd lean towards having true (no host devices) as the default.

I'm fairly certain this would break standard kubernetes tests.

You can configure this for your clusters today though with the poorly documented containerdConfigPatch https://kind.sigs.k8s.io/docs/user/private-registries/#use-a-certificate

adelton commented 10 months ago

Ah, great.

I confirm that with

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane

[...]

containerdConfigPatches:
  - |-
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      privileged_without_host_devices = true

things work just fine.

I'm closing this issue as I have a way to address the problem I've been hitting. If you think that exposing this in some way (possibly in documentation?) might be helpful to others, let me know.

BenTheElder commented 10 months ago

I'd like to reopen this if you don't mind because I know other users are going to hit this and requiring the workaround config is still unfortunate.

We should probably add a "known issues" page entry to start with a pointer to this configuration and continue to track this while we consider options to automatically mitigate.

I think it will be pretty involved to implement but ideally we'd just trim missing entries.

BenTheElder commented 10 months ago

Actually, in the docker issue there's a suggestion to just bind mount /dev explicitly to avoid this behavior? 👀

https://github.com/moby/moby/issues/16160#issuecomment-551388571

BenTheElder commented 10 months ago

We can test this with extraMounts hostPath: /dev containerPath: /dev

adelton commented 10 months ago

I confirm that with

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  extraMounts:
  - hostPath: /dev
    containerPath: /dev

the problem is gone as well.

After the removal of the USB mouse, the device node gets removed from host's /dev/bus/usb/003/ and it is no longer shown in

podman exec kind-control-plane mount | grep ' on /dev'

and creating a pod with a privileged container passes as well.

With this approach, I would just be concerned about implications on /dev/tty and similar non-global, per process devices.

BenTheElder commented 9 months ago

With this approach, I would just be concerned about implications on /dev/tty and similar non-global, per process devices.

/dev/tty at least I'm pretty sure gets specially setup in run regardless, but I share that concern, I'd want to carefully investigate before doing this by default, but it seems like this might be sufficient