Open adelton opened 1 year ago
I don't have very clear from the description ... is an error from the local-path-provisioner or is any pod in kind that does not work?
The error comes from containerd attempting to start the helper-pod-create-pvc-1e7e0729-1ec4-4b0e-91ef-3c41e0495783
that gets initiated by the local-path-provisioner-6bc4bddd6b-rnsqd
to fulfill the PVC request that comes from https://github.com/rancher/local-path-provisioner/blob/master/examples/pvc-with-local-volume/pvc.yaml.
is a https://github.com/rancher/local-path-provisioner bug then?
I don't think the code in local-path-provisioner does much with setting up the root fs and the mount points for the pod.
This seems to be related to how the "nodes" are created and represented by Kind / init / containerd / something and what they assume and inherit.
I don't have very clear from the description ... is an error from the local-path-provisioner or is any pod in kind that does not work?
That is why I asked this, is this with any pod or only with this specific pod?
Ah, you meant if there is something wrong about that specific example? Not really, when I turn it into a trivial busybox container with
apiVersion: v1
kind: Pod
metadata:
name: volume-test-2
spec:
containers:
- name: volume-test-2
image: busybox
imagePullPolicy: IfNotPresent
command:
- mount
volumeMounts:
- name: volv2
mountPath: /data2
volumes:
- name: volv2
persistentVolumeClaim:
claimName: local-volume-pvc-2
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: local-volume-pvc-2
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Mi
I get the very same error message once the list of USB devices changes.
what I'm trying to understand is if it is a general problem or only happens because of the PersistentVolumes
I only saw it with that helper pod. When I apply a pod without any volumes
apiVersion: v1
kind: Pod
metadata:
name: no-volume
spec:
containers:
- name: no-volume
image: busybox
imagePullPolicy: IfNotPresent
command:
- mount
the pod and container get created and run fine. The mount
output shows a very limited set of things mounted under /dev/
in that case:
$ kubectl logs pod/no-volume | grep ' on /dev/'
devpts on /dev/pts type devpts (rw,seclabel,nosuid,noexec,relatime,gid=524292,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,seclabel,nosuid,nodev,noexec,relatime)
/dev/mapper/vg_machine-lv_containers on /dev/termination-log type ext4 (rw,seclabel,relatime)
shm on /dev/shm type tmpfs (rw,seclabel,nosuid,nodev,noexec,relatime,size=65536k,uid=2000,gid=2000,inode64)
devtmpfs on /dev/null type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/random type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/full type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/tty type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/zero type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/urandom type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
To debug, when I
kubectl edit -n local-path-storage cm local-path-config
and change image
to busybox
and add a mount
and sleep
to setup
with
apiVersion: v1
kind: Pod
metadata:
name: helper-pod
spec:
containers:
- name: helper-pod
image: busybox
imagePullPolicy: IfNotPresent
setup: |-
#!/bin/sh
set -eu
mount
sleep 30
mkdir -m 0777 -p "$VOL_DIR"
and
kubectl rollout restart deployment local-path-provisioner -n local-path-storage
provisioning the pod with a PVC shows huge number of bind (?) mounts:
kubectl logs -n local-path-storage helper-pod-create-pvc-59b95912-a254-454b-b26b-889c10b217c6 | grep ' on /dev/'
devpts on /dev/pts type devpts (rw,seclabel,nosuid,noexec,relatime,gid=524292,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,seclabel,nosuid,nodev,noexec,relatime)
/dev/mapper/vg_machine-lv_containers on /dev/termination-log type ext4 (rw,seclabel,relatime)
shm on /dev/shm type tmpfs (rw,seclabel,nosuid,nodev,noexec,relatime,size=65536k,uid=2000,gid=2000,inode64)
devtmpfs on /dev/acpi_thermal_rel type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/autofs type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/btrfs-control type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/001/001 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/002/001 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/003/001 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/003/003 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/003/050 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/004/001 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/0/cpuid type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/0/msr type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/1/cpuid type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/1/msr type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/2/cpuid type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/2/msr type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/3/cpuid type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/3/msr type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/4/cpuid type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
[...]
devtmpfs on /dev/watchdog type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/watchdog0 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/zero type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/zram0 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
So something is different between the "normal" pods/containers and the pod/container created as the helper for the local-path provisioner.
We don't control the device mounts being propagated from the host to the "node", that's podman.
The helper pod is privileged which is why it is also seeing all the mounts, unlike your simple test pod. https://github.com/rancher/local-path-provisioner/blob/4d42c70e748fed13cd66f86656e909184a5b08d2/provisioner.go#L553
Thanks for that pointer -- I confirm that when I add
securityContext:
privileged: true
to my regular container, I get the same issues as with the local-path helper.
What I'd like to figure out though: you say "we don't control the device mounts being propagated from the host to the "node"". But in this case it is not propagation of the device mounts from the host because on the host the /dev/bus/usb/*/*
device is no longer there. So it is being propagated from something else, possibly some parent (?) pod (?) that has a list of devices that it once saw?
IIRC docker/podman will sync all the /dev entries on creating the container, but there is not mount propagation to reflect updated entries. Then the nested containerd/runc will try to create these for the "inner" pod containers.
I don't think there are great solutions here ... maybe we can find a way to detect these "dangling" mounts and remove them from the node or hook the inner runc.
FWIW kind clusters are meant to be disposable and quick to create so maybe recreate after changing devices :/
The opposite is a known issue with docker: "privileged containers do not reflect newly added host devices" has been a longstanding issue as I recall. We should look at what workarounds people are using for this since it's more or less the same root issue: https://github.com/moby/moby/issues/16160
Well realistically I'd be OK to just disable any propagation of /dev/bus/usb
to the containers, either the first one (podman
), or the next layer (containerd
?). Is the search for the devices somehow configurable in either of those cases?
Well realistically I'd be OK to just disable any propagation of /dev/bus/usb to the containers, either the first one (podman), or the next layer (containerd?). Is the search for the devices somehow configurable in either of those cases?
No, we're not even telling podman/docker to pass through these to the node, it's implicit with --privileged
which we need to run Kubernetes/containerd.
Ditto with the privileged pods. Everything under dev gets passed through IIRC*
* a TTY for the container may be setup specially.
So with some experimentation, I got the setup working with
--- a/images/base/files/etc/containerd/config.toml
+++ b/images/base/files/etc/containerd/config.toml
@@ -19,6 +19,9 @@ version = 2
runtime_type = "io.containerd.runc.v2"
# Generated by "ctr oci spec" and modified at base container to mount poduct_uuid
base_runtime_spec = "/etc/containerd/cri-base.json"
+
+ privileged_without_host_devices = true
+
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
# use systemd cgroup by default
SystemdCgroup = true
and rebuilding the base and node images.
I tested it with a rootless podman
and both pods with PVs and running a privileged pod works, both with the USB unplug use-case and suspending the laptop and waking it up. I did not try any additional tests to see what this might break. If I file this as a pull request, will you allow the tests to run to see what it discovers in the general Kind testing / CI?
Now the questions is if / how to make this available in Kind in general, what the default should be, and what mechanism to provide for people to override it.
Given not having those devices in the privileged containers seems like a safer default, and with https://github.com/moby/moby/issues/16160 unaddressed hotplugging of devices does not work with docker anyway, I'd lean towards having true
(no host devices) as the default.
But what should people use to override it?
Mounting the config.toml
via extraMounts
does not work because it gets manipulated at least in https://github.com/kubernetes-sigs/kind/blob/main/images/base/files/usr/local/bin/entrypoint.
We could add another KIND_EXPERIMENTAL_CONTAINERD_
variable and amend that sed -i
logic to use it.
We could also use
imports = ["/etc/containerd/config.d/*.toml"]
and document extraMounts
-ing any overrides into that directory. In fact, the configure_containerd
in https://github.com/kubernetes-sigs/kind/blob/main/images/base/files/usr/local/bin/entrypoint could use that mechanism instead of that sed -i
approach as well.
I don't want to make a change like moving from that sed -i
to drop-in snippets just for this device-mounting issue ... but I'd be happy to provide a PR do switch to the drop-in snippets approach if it is viewed as a useful approach in general.
Now the questions is if / how to make this available in Kind in general, what the default should be, and what mechanism to provide for people to override it.
I suspect this would break a LOT of users doing interesting driver development.
Given not having those devices in the privileged containers seems like a safer default, and with https://github.com/moby/moby/issues/16160 unaddressed hotplugging of devices does not work with docker anyway, I'd lean towards having true (no host devices) as the default.
I'm fairly certain this would break standard kubernetes tests.
You can configure this for your clusters today though with the poorly documented containerdConfigPatch https://kind.sigs.k8s.io/docs/user/private-registries/#use-a-certificate
Ah, great.
I confirm that with
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
[...]
containerdConfigPatches:
- |-
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
privileged_without_host_devices = true
things work just fine.
I'm closing this issue as I have a way to address the problem I've been hitting. If you think that exposing this in some way (possibly in documentation?) might be helpful to others, let me know.
I'd like to reopen this if you don't mind because I know other users are going to hit this and requiring the workaround config is still unfortunate.
We should probably add a "known issues" page entry to start with a pointer to this configuration and continue to track this while we consider options to automatically mitigate.
I think it will be pretty involved to implement but ideally we'd just trim missing entries.
Actually, in the docker issue there's a suggestion to just bind mount /dev explicitly to avoid this behavior? 👀
https://github.com/moby/moby/issues/16160#issuecomment-551388571
We can test this with extraMounts hostPath: /dev containerPath: /dev
I confirm that with
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
extraMounts:
- hostPath: /dev
containerPath: /dev
the problem is gone as well.
After the removal of the USB mouse, the device node gets removed from host's /dev/bus/usb/003/
and it is no longer shown in
podman exec kind-control-plane mount | grep ' on /dev'
and creating a pod with a privileged container passes as well.
With this approach, I would just be concerned about implications on /dev/tty
and similar non-global, per process devices.
With this approach, I would just be concerned about implications on /dev/tty and similar non-global, per process devices.
/dev/tty at least I'm pretty sure gets specially setup in run regardless, but I share that concern, I'd want to carefully investigate before doing this by default, but it seems like this might be sufficient
I've stumbled upon this problem.
I am also running kind through rootless podman. After having seen a failure event from local storage provisioner indicating it cannot mount /dev/bus/usb/...
, I've recreated a cluster with the suggested workaround.
So far, it solved the problem, and I have not seen any regression elsewhere. I have opened an interactive console in a pod, forwarded ports, etc. and have not seen any negative side-effect.
There may be some workloads that attempt to manipulate /dev/...
in a way that will be less isolated with this mount, see also https://github.com/moby/moby/issues/16160#issuecomment-254805195
https://github.com/kubernetes-sigs/kind/issues/3389#issuecomment-1781463169 may be a more reasonable workaround for rootless podman.
We'll have to be careful with any default changes around this.
@BenTheElder Thanks for the tip, I've missed this configuration option. So, just to provide feedback: I've tested both workarounds (either based on extraMounts or containerdConfigPatches), they both work for me.
What happened:
I'm running Kind (with
export KIND_EXPERIMENTAL_PROVIDER=podman
) on my laptop. When I start the cluster while a mouse is connected to the machine, I'm able to create a pod with a local volume. Once I remove that mouse, this starts to fail.The same issue happens when I close the lid to have the laptop go to sleep, and then wake it up again.
What you expected to happen:
Setup of PVCs and PVs continues to work.
How to reproduce it (as minimally and precisely as possible):
export KIND_EXPERIMENTAL_PROVIDER=podman
lsusb
returns something likekind create cluster
standard
storageclass under the namelocal-path
, something likecat storageclass-local-path.yaml
kubectl apply -f storageclass-local-path.yaml
kubectl apply -k 'https://github.com/rancher/local-path-provisioner/examples/pod-with-local-volume'
kubectl get pods -A
showvolume-test
in namespacedefault
as Running.kubectl delete -k 'https://github.com/rancher/local-path-provisioner/examples/pod-with-local-volume'
lsusb
that the device003/046
or whatever ids it had is no longer there.kubectl apply -k 'https://github.com/rancher/local-path-provisioner/examples/pod-with-local-volume'
kubectl get pods -A
showskubectl events -n local-path-storage deployment/local-path-provisioner
showsAnything else we need to know?:
I actually first encountered it when I suspended the laptop and then woken it up and wanted to continue using the Kind cluster. The
Bus 003 Device 044: ID 06cb:00f9 Synaptics, Inc.
device gets a different device id upon wakeup.Environment:
kind version
):kind v0.20.0 go1.20.4 linux/amd64
docker info
orpodman info
):/etc/os-release
):CPE_NAME="cpe:/o:fedoraproject:fedora:38"
kubectl version
):KIND_EXPERIMENTAL_PROVIDER=podman