opencontainers / runc

CLI tool for spawning and running containers according to the OCI specification
https://www.opencontainers.org/
Apache License 2.0
11.9k stars 2.11k forks source link

runc has problems due to leaked mount information #2404

Closed chrischdi closed 1 month ago

chrischdi commented 4 years ago

I'm coming over her by debugging https://github.com/kubernetes/kubernetes/issues/91023 together with containerd v1.3.4 which ships runc:

runc version 1.0.0-rc10
commit: dc9208a3303feef5b3839f4323d9beb36df0a9dd-dirty
spec: 1.0.1-dev

We have identified that there is some kind of leak of cgroup mounts which result in e.g. the following lines in /proc/self/mountinfo:

root@kube-node01:~# cat /proc/self/mountinfo | grep cgroup
30 21 0:26 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:9 - tmpfs tmpfs ro,mode=755
31 30 0:27 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime shared:10 - cgroup2 cgroup2 rw
32 30 0:28 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:11 - cgroup cgroup rw,xattr,name=systemd
35 30 0:31 / /sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,rdma
36 30 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,cpu,cpuacct
37 30 0:33 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,freezer
38 30 0:34 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,blkio
39 30 0:35 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - cgroup cgroup rw,pids
40 30 0:36 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:20 - cgroup cgroup rw,cpuset
41 30 0:37 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,memory
42 30 0:38 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:22 - cgroup cgroup rw,net_cls,net_prio
43 30 0:39 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:23 - cgroup cgroup rw,hugetlb
44 30 0:40 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:24 - cgroup cgroup rw,perf_event
45 30 0:41 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:25 - cgroup cgroup rw,devices
945 25 0:159 / /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:606 - tmpfs tmpfs rw,mode=755
979 945 0:28 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:11 - cgroup cgroup rw,xattr,name=systemd
1471 945 0:31 / /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,rdma
1696 945 0:32 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,cpu,cpuacct
1714 945 0:33 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,freezer
2099 945 0:34 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,blkio
2345 945 0:35 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - cgroup cgroup rw,pids
2390 945 0:36 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:20 - cgroup cgroup rw,cpuset
2409 945 0:37 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,memory
2428 945 0:38 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:22 - cgroup cgroup rw,net_cls,net_prio
2447 945 0:39 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:23 - cgroup cgroup rw,hugetlb
2466 945 0:40 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:24 - cgroup cgroup rw,perf_event
2485 945 0:41 /kubepods/pod2eb4976a-5001-4c2b-b5cf-df562549e3d4/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2e75a956e9db372ddf40ef4c32d148100010386140238c6173ad6e4d45fd8e1c/rootfs/sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:25 - cgroup cgroup rw,devices

When such a leak does exist, runc tries to use use a wrong cgroup during libcontainer/rootfs_linux's prepareRootfs.

May 14 07:10:35 c04pc003-kube-node03 containerd[28410]: time="2020-05-14T07:10:35.468879890Z" level=error msg="StartContainer for \"b19c361eda463efba332d6a6a365ea4c48ec461823b4ae01456033c95f663372\" failed" error="failed to create containerd task: OCI runtime create failed: container
_linux.go:349: starting container process caused \"process_linux.go:449: container init caused \\\"rootfs_linux.go:58: mounting \\\\\\\"cgroup\\\\\\\" to rootfs \\\\\\\"/run/containerd/io.containerd.runtime.v1.linux/k8s.io/b19c361eda463efba332d6a6a365ea4c48ec461823b4ae01456033c95f663
372/rootfs\\\\\\\" at \\\\\\\"/sys/fs/cgroup\\\\\\\" caused \\\\\\\"stat /run/foo/rootfs/sys/fs/pod81924e98-98f5-45b8-9bb7-0e8ee2f77d12/b19c361eda463efba332d6a6a365ea4c48ec461823b4ae01456033c95f663372: no such file or directory\\\\\\\"\\\"\": unknown"

I was able to reproduce the bug by:

This results in the above output.

I was able to debug a bit into runc here and found the following The function GetCgroupMounts(false) returns in this case the wrong mountpoint for the systemd cgroup (/run/foo/rootfs/sys/fs/cgroup/systemd insetad of /sys/fs/cgroup/systemd).

This is because in /proc/self/mountinfo the mount /run/foo/rootfs/sys/fs/cgroup/systemd occured before /sys/fs/cgroup/systemd (which seems weird for me, because having a look myself to /proc/self/mountinfo and processing it would order them the other way around).

As a POC I added the following patch to runc which fixed it for my test case:

diff --git a/libcontainer/cgroups/utils.go b/libcontainer/cgroups/utils.go
index dbcc58f5..25d57efe 100644
--- a/libcontainer/cgroups/utils.go
+++ b/libcontainer/cgroups/utils.go
@@ -208,6 +208,9 @@ func getCgroupMountsHelper(ss map[string]bool, mi io.Reader, all bool) ([]Mount,
                        Mountpoint: fields[4],
                        Root:       fields[3],
                }
+               if strings.HasPrefix(m.Mountpoint, "/run/foo") {
+                       continue
+               }
                for _, opt := range strings.Split(fields[len(fields)-1], ",") {
                        seen, known := ss[opt]
                        if !known || (!all && seen) {

of course this does not work for upstream, at least to fix the original leak I would need to match on something like /run/containerd/.

kolyshkin commented 4 years ago

The leaked mounts are most probably the result of wrong mount propagation used.

The bug it causes might be workarounded by making GetCgroupMounts prefer the standard paths, i.e. /sys/fs/cgroup/$controller. I will look into it.

chrischdi commented 4 years ago

Hi @kolyshkin , with wrong mount propagation: Is there something we can do about it? We are using kubernetes and the problem occurs only on the CSI Nodeplugin, which needs to have bidirectional mount propagation for /var/lib/kubelet and HostToContainer for /dev.

The first one is needed, because the CSI plugin does do mounts for other containers (prior to the other pod starts) and I think /dev is needed for CSI to see new attached disks.

Thank you for looking into it :-)

kolyshkin commented 4 years ago

One other approach to workaround it would be to check the parent ID field (second field in /proc/self/mountinfoto be the same for all mounts (or to be equal to themount IDfield of/sys/fs/cgroup` mount.

I still don't understand why GetCgroupMounts is not picking up the first mount. I know there is a race in the kernel when it comes to serving /proc/self/mountinfo (and similar files) -- in particular, if the next entry to be read is deleted (i.e. the mount is unmounted), the rest of mountinfo is never read. But it is not applicable to the case.

chrischdi commented 4 years ago

When I debugged into it I have seen that the entries in /proc/self/mountinfo were ordered in another way, compared to when I did a simple cat /proc/self/mountinfo. But I also don't know why the output was not the same.

kolyshkin commented 4 years ago

In fact we should always use /sys/fs/cgroup, this seems to be the de-facto standard these days. It will still be interesting to see /proc/self/mountinfo where other cgroup entries precede those with /sys/fs/cgroup mountpotint.

isehgu commented 4 years ago

We are also experiencing the same thing with csi-rbd plugin. Found @chrischdi thread, and was able to delete the extra mounts in order for kubelet to come up. We are on coreos -- 4.19.106 (Coreos 2345.3.0). The extra mounts weren't with the string /run/containerd/. They aren't uniform neither.

LastNight1997 commented 1 year ago

@kolyshkin We have the same problem too, and I find not only cgroup mount was leaked to host mount ns, all mount in the csi container which use the bidirection mount propagation were leaked.

❯ cat mount|grep 7d0849b82c48 overlay on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs) type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/278/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/244/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/15/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/279/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/279/work,index=off,nfs_export=off) overlay on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs) type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/278/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/244/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/15/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/279/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/279/work,index=off,nfs_export=off) proc on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/proc](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/proc) type proc (rw,nosuid,nodev,noexec,relatime) proc on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/proc](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/proc) type proc (rw,nosuid,nodev,noexec,relatime) sysfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys) type sysfs (ro,nosuid,nodev,noexec,relatime) sysfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys) type sysfs (ro,nosuid,nodev,noexec,relatime) tmpfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup) type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755) tmpfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup) type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/systemd](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/systemd) type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/systemd](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/systemd) type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/cpuset](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/cpuset) type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/cpuset](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/cpuset) type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/cpu,cpuacct](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/cpu,cpuacct) type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/cpu,cpuacct](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/cpu,cpuacct) type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/memory](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/memory) type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/memory](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/memory) type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/devices](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/devices) type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/devices](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/devices) type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/net_cls,net_prio](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/net_cls,net_prio) type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/net_cls,net_prio](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/net_cls,net_prio) type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/hugetlb](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/hugetlb) type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/hugetlb](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/hugetlb) type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/perf_event](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/perf_event) type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/perf_event](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/perf_event) type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/blkio](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/blkio) type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/blkio](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/blkio) type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/freezer](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/freezer) type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/freezer](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/freezer) type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/pids](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/pids) type cgroup (rw,nosuid,nodev,noexec,relatime,pids) cgroup on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/pids](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/fs/cgroup/pids) type cgroup (rw,nosuid,nodev,noexec,relatime,pids) /dev/vdb on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/csi](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/csi) type ext4 (rw,relatime) /dev/vdb on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/csi](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/csi) type ext4 (rw,relatime) udev on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev) type devtmpfs (rw,nosuid,relatime,size=119520900k,nr_inodes=29880225,mode=755) devpts on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev/pts](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev/pts) type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) tmpfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev/shm](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev/shm) type tmpfs (rw,nosuid,nodev) hugetlbfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev/hugepages](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev/hugepages) type hugetlbfs (rw,relatime,pagesize=2M) mqueue on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev/mqueue](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev/mqueue) type mqueue (rw,relatime) udev on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev) type devtmpfs (rw,nosuid,relatime,size=119520900k,nr_inodes=29880225,mode=755) devpts on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev/pts](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev/pts) type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) tmpfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev/shm](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev/shm) type tmpfs (rw,nosuid,nodev) hugetlbfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev/hugepages](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev/hugepages) type hugetlbfs (rw,relatime,pagesize=2M) mqueue on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev/mqueue](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/dev/mqueue) type mqueue (rw,relatime) /dev/vdb on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/etc/resolv.conf](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/etc/resolv.conf) type ext4 (rw,relatime) /dev/vdb on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/etc/resolv.conf](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/etc/resolv.conf) type ext4 (rw,relatime) /dev/vdb on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/etc/hosts](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/etc/hosts) type ext4 (rw,relatime) /dev/vdb on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/etc/hosts](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/etc/hosts) type ext4 (rw,relatime) /dev/vdb on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/etc/hostname](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/etc/hostname) type ext4 (rw,relatime) /dev/vdb on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/etc/hostname](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/etc/hostname) type ext4 (rw,relatime) /dev/vdb on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet) type ext4 (rw,relatime) tmpfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/5910334a-cfd1-456f-a0e8-188e12b9bd43/volumes/kubernetes.io~secret/kube-proxy-token-74vlm](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/5910334a-cfd1-456f-a0e8-188e12b9bd43/volumes/kubernetes.io~secret/kube-proxy-token-74vlm) type tmpfs (rw,relatime) tmpfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/00fe260c-76ad-45f0-add3-58e4bd3b8981/volumes/kubernetes.io~secret/csi-ebs-token-968j7](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/00fe260c-76ad-45f0-add3-58e4bd3b8981/volumes/kubernetes.io~secret/csi-ebs-token-968j7) type tmpfs (rw,relatime) tmpfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/89ed4967-4749-40ef-8e0c-c8dfd598bbe6/volumes/kubernetes.io~secret/csi-nas-token-qrrdb](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/89ed4967-4749-40ef-8e0c-c8dfd598bbe6/volumes/kubernetes.io~secret/csi-nas-token-qrrdb) type tmpfs (rw,relatime) tmpfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/0f6e4e69-7b22-49a3-a8c1-1976dd377d1c/volumes/kubernetes.io~secret/flannel-token-wc6dz](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/0f6e4e69-7b22-49a3-a8c1-1976dd377d1c/volumes/kubernetes.io~secret/flannel-token-wc6dz) type tmpfs (rw,relatime) tmpfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/23c56271-a650-4b3e-82c8-8df5084812f5/volumes/kubernetes.io~secret/default-token-ftxhn](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/23c56271-a650-4b3e-82c8-8df5084812f5/volumes/kubernetes.io~secret/default-token-ftxhn) type tmpfs (rw,relatime) tmpfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/a3f0f8a1-fb25-4fae-b435-4038a9428cd7/volumes/kubernetes.io~secret/default-token-ftxhn](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/a3f0f8a1-fb25-4fae-b435-4038a9428cd7/volumes/kubernetes.io~secret/default-token-ftxhn) type tmpfs (rw,relatime) tmpfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/23c56271-a650-4b3e-82c8-8df5084812f5/volumes/kubernetes.io~empty-dir/workdir](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/23c56271-a650-4b3e-82c8-8df5084812f5/volumes/kubernetes.io~empty-dir/workdir) type tmpfs (rw,relatime) tmpfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/a3f0f8a1-fb25-4fae-b435-4038a9428cd7/volumes/kubernetes.io~empty-dir/workdir](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/a3f0f8a1-fb25-4fae-b435-4038a9428cd7/volumes/kubernetes.io~empty-dir/workdir) type tmpfs (rw,relatime) tmpfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/c5262037-7baf-457c-8b9c-b1f768b12b4f/volumes/kubernetes.io~secret/default-token-ftxhn](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/c5262037-7baf-457c-8b9c-b1f768b12b4f/volumes/kubernetes.io~secret/default-token-ftxhn) type tmpfs (rw,relatime) tmpfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/c5262037-7baf-457c-8b9c-b1f768b12b4f/volumes/kubernetes.io~empty-dir/workdir](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/c5262037-7baf-457c-8b9c-b1f768b12b4f/volumes/kubernetes.io~empty-dir/workdir) type tmpfs (rw,relatime) tmpfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/d87fd265-3bc5-41a2-95d5-a4860115b1d2/volumes/kubernetes.io~secret/cattle-credentials](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/d87fd265-3bc5-41a2-95d5-a4860115b1d2/volumes/kubernetes.io~secret/cattle-credentials) type tmpfs (rw,relatime) tmpfs on /run/containerd/io.containerd.runtime.v2.task/[k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/d87fd265-3bc5-41a2-95d5-a4860115b1d2/volumes/kubernetes.io~secret/cattle-token-7gwj7](http://k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/var/lib/kubelet/pods/d87fd265-3bc5-41a2-95d5-a4860115b1d2/volumes/kubernetes.io~secret/cattle-token-7gwj7) type tmpfs (rw,relatime) tmpfs on /run/containerd/io

may be there are some bugs in runc? such as prepareRoot or mountToRootfs func?

LastNight1997 commented 1 year ago

This is some logs of leaked csi container 7d0849b82c486573d1. I observed the leak at Aug 11 14:15:59, and the container strat failed at Aug 11 14:16:07, which means leaked happen before csi container running, so I suspect the problem may be in runc.

Aug 11 14:15:56 ncjat34u33gu8siupfck0 containerd[3828]: time="2023-08-11T14:15:56.070737230+08:00" level=info msg="CreateContainer within sandbox \"05e2bfd2393844bba91801e98c93127bcfac944c41bdb5a977ab327b27d6bee4\" for &ContainerMetadata{ Name:csi-ebs-driver,Attempt:0,} returns container id \"7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb\"" Aug 11 14:15:56 ncjat34u33gu8siupfck0 containerd[3828]: time="2023-08-11T14:15:56.071110121+08:00" level=info msg="StartContainer for \"7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb\"" Aug 11 14:15:59 ncjat34u33gu8siupfck0 containerd[3828]: time="2023-08-11T14:15:59.780516679+08:00" level=warning msg="failed to cleanup rootfs mount" error="failed to unmount target /run/containerd/io.containerd.runtime.v2.task/k8s.io/7d0 849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs: device or resource busy" Aug 11 14:15:59 ncjat34u33gu8siupfck0 containerd[3828]: time="2023-08-11T14:15:59.781017810+08:00" level=info msg="shim disconnected" id=7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb Aug 11 14:15:59 ncjat34u33gu8siupfck0 containerd[3828]: time="2023-08-11T14:15:59.781054748+08:00" level=warning msg="cleaning up after shim disconnected" id=7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb namespace=k8s.i o Aug 11 14:15:59 ncjat34u33gu8siupfck0 containerd[3828]: time="2023-08-11T14:15:59.782681956+08:00" level=error msg="collecting metrics for 7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb" error="ttrpc: closed: unknown" Aug 11 14:16:04 ncjat34u33gu8siupfck0 containerd[3828]: time="2023-08-11T14:16:04.834477825+08:00" level=warning msg="failed to clean up after shim disconnected" error="unmount rootfs /run/containerd/io.containerd.runtime.v2.task/k8s.io/7 d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs: failed to unmount target /run/containerd/io.containerd.runtime.v2.task/k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs: device or resou rce busy" id=7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb namespace=k8s.io Aug 11 14:16:07 ncjat34u33gu8siupfck0 containerd[3828]: time="2023-08-11T14:16:07.313355781+08:00" level=error msg="failed to delete bundle" error="unmount rootfs /run/containerd/io.containerd.runtime.v2.task/k8s.io/7d0849b82c486573d16484 09c675bf3f67113db630639075ed25ad73a122b2bb/rootfs: failed to unmount target /run/containerd/io.containerd.runtime.v2.task/k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs: device or resource busy" id=7d0849b8 2c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb Aug 11 14:16:07 ncjat34u33gu8siupfck0 containerd[3828]: time="2023-08-11T14:16:07.313419607+08:00" level=error msg="failed to delete shim" error="2 errors occurred:\n\t* close wait error: context deadline exceeded\n\t* failed to delete bu ndle: unmount rootfs /run/containerd/io.containerd.runtime.v2.task/k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs: failed to unmount target /run/containerd/io.containerd.runtime.v2.task/k8s.io/7d0849b82c486 573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs: device or resource busy\n\n" id=7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb Aug 11 14:16:07 ncjat34u33gu8siupfck0 containerd[3828]: time="2023-08-11T14:16:07.313587220+08:00" level=error msg="Failed to pipe stdout of container \"7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb\"" error="reading fr om a closed fifo" Aug 11 14:16:07 ncjat34u33gu8siupfck0 containerd[3828]: time="2023-08-11T14:16:07.313647067+08:00" level=error msg="Failed to pipe stderr of container \"7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb\"" error="reading fr om a closed fifo" Aug 11 14:16:07 ncjat34u33gu8siupfck0 containerd[3828]: time="2023-08-11T14:16:07.314940439+08:00" level=error msg="StartContainer for \"7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb\" failed" error="failed to create co ntainerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error jailing process inside rootfs: pivot_root .: invalid argument: unknown" Aug 11 14:16:07 ncjat34u33gu8siupfck0 kubelet[4635]: E0811 14:16:07.315198 4635 remote_runtime.go:251] StartContainer "7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb" from runtime service failed: rpc error: code = Unk nown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error jailing process inside rootfs: pivot_root .: in valid argument: unknown Aug 11 14:16:07 ncjat34u33gu8siupfck0 kubelet[4635]: I0811 14:16:07.568652 4635 scope.go:111] [topologymanager] RemoveContainer - Container ID: 7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb Aug 11 14:16:08 ncjat34u33gu8siupfck0 kubelet[4635]: I0811 14:16:08.587152 4635 scope.go:111] [topologymanager] RemoveContainer - Container ID: 7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb Aug 11 14:16:08 ncjat34u33gu8siupfck0 containerd[3828]: time="2023-08-11T14:16:08.588202901+08:00" level=info msg="RemoveContainer for \"7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb\"" Aug 11 14:16:08 ncjat34u33gu8siupfck0 containerd[3828]: time="2023-08-11T14:16:08.597228076+08:00" level=info msg="RemoveContainer for \"7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb\" returns successfully"

Aug 11 14:15:59 ncjat34u33gu8siupfck0 kubelet[4635]: E0811 14:15:59.510742 4635 pod_workers.go:191] Error syncing pod 150463db-1acb-4296-90d5-1911ff99ad5d ("e9064fed-19360-22319-765df758cf-f875d_aiplay-v2(150463db-1acb-4296-90d5-1911ff99ad5d)"), skipping: failed to "StartContainer" for "e9064fed-19360-22319" with RunContainerError: "failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting \"cgroup\" to rootfs at \"/sys/fs/cgroup\": stat /run/containerd/io.containerd.runtime.v2.task/k8s.io/7d0849b82c486573d1648409c675bf3f67113db630639075ed25ad73a122b2bb/rootfs/sys/pod150463db-1acb-4296-90d5-1911ff99ad5d/53bb057d03426ffee271042471998769259716007bf32a21c04d227e440ef40b: no such file or directory: unknown"

zhaodiaoer commented 7 months ago

I have experiencing the same leak problem, and i have spent some days to try to find out root cause but i don't found yet, i found one obvious problem after inspected my host environment where mount leak problem occurred and digging into runc code:

And after experiencing the same problem four times, I found that all containers with issues had RootPropagation configurations with a value of 1064960 (aka rshare), which would make runc configure the root mount in the new mount namespace with the rshare propagation option. By default, this option is rslave, so I guess the "rootPropagation": 1064960 configuration item in config.json is the issue initiator.

In addition, not only cgroup mounts will leak, in my environment, all mounts in config.json leaked into the host mount namespace, such as servercertificates coming from k8s. My runc crashed at pivote_root, and mounts under rootfs took effect before pivote_root.

There are some information from my problem case, hope they can be of some use:

lujinda commented 3 months ago

I have experiencing the same leak problem, and i have spent some days to try to find out root cause but i don't found yet, i found one obvious problem after inspected my host environment where mount leak problem occurred and digging into runc code:

  • Some processes within the prepareRoot(config *configs.Config) function worked abnormally, because after the rootfsParentMountPrivate(config.Rootfs) function returned with no error, I still found shared rootfs mounts leaked on the host;

And after experiencing the same problem four times, I found that all containers with issues had RootPropagation configurations with a value of 1064960 (aka rshare), which would make runc configure the root mount in the new mount namespace with the rshare propagation option. By default, this option is rslave, so I guess the "rootPropagation": 1064960 configuration item in config.json is the issue initiator.

In addition, not only cgroup mounts will leak, in my environment, all mounts in config.json leaked into the host mount namespace, such as servercertificates coming from k8s. My runc crashed at pivote_root, and mounts under rootfs took effect before pivote_root.

There are some information from my problem case, hope they can be of some use:

  • Environment informations:

    • OS: Debian GNU/Linux 9
    • Kernel: 5.4.210 amd64
    • RunC version: v1.0.2
  • The leaked container rootfs mount in host (the second item should not exist in host mount namespace):
    3353 84 0:1132 / /var/lib/containerd/state/io.containerd.runtime.v2.task/k8s.io/73eb4f4f0ee2e2ebb66b7db135f8f019550c9111629b898da69bf3053a40af71/rootfs rw,relatime shared:1261
    9248 3353 0:1132 / /var/lib/containerd/state/io.containerd.runtime.v2.task/k8s.io/73eb4f4f0ee2e2ebb66b7db135f8f019550c9111629b898da69bf3053a40af71/rootfs rw,relatime shared:1261
    (here are also hundreds mount items over rootfs mount path, omitted)
  • Logs of crashed runc:
    {"level":"error","msg":"container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:134: jailing process inside rootfs caused: pivot_root invalid argument","time":"2024-03-05T17:05:23+08:00"}

rootfsPropagation is rshared because at least one mount is configured for bidirectional propagation.

I encountered the same problem. I also think that the rootfsParentMountPrivate function has an exception, although it does not return an error. rootfsParentMountPrivate does not change rootfs to private. The implementation of the rootfsParentMountPrivate function is not complicated. Maybe it encountered a race condition?

Did you finally locate the cause?

LastNight1997 commented 3 months ago

I have experiencing the same leak problem, and i have spent some days to try to find out root cause but i don't found yet, i found one obvious problem after inspected my host environment where mount leak problem occurred and digging into runc code:

  • Some processes within the prepareRoot(config *configs.Config) function worked abnormally, because after the rootfsParentMountPrivate(config.Rootfs) function returned with no error, I still found shared rootfs mounts leaked on the host;

And after experiencing the same problem four times, I found that all containers with issues had RootPropagation configurations with a value of 1064960 (aka rshare), which would make runc configure the root mount in the new mount namespace with the rshare propagation option. By default, this option is rslave, so I guess the "rootPropagation": 1064960 configuration item in config.json is the issue initiator. In addition, not only cgroup mounts will leak, in my environment, all mounts in config.json leaked into the host mount namespace, such as servercertificates coming from k8s. My runc crashed at pivote_root, and mounts under rootfs took effect before pivote_root. There are some information from my problem case, hope they can be of some use:

  • Environment informations:

    • OS: Debian GNU/Linux 9
    • Kernel: 5.4.210 amd64
    • RunC version: v1.0.2
  • The leaked container rootfs mount in host (the second item should not exist in host mount namespace):
    3353 84 0:1132 / /var/lib/containerd/state/io.containerd.runtime.v2.task/k8s.io/73eb4f4f0ee2e2ebb66b7db135f8f019550c9111629b898da69bf3053a40af71/rootfs rw,relatime shared:1261
    9248 3353 0:1132 / /var/lib/containerd/state/io.containerd.runtime.v2.task/k8s.io/73eb4f4f0ee2e2ebb66b7db135f8f019550c9111629b898da69bf3053a40af71/rootfs rw,relatime shared:1261
    (here are also hundreds mount items over rootfs mount path, omitted)
  • Logs of crashed runc:
    {"level":"error","msg":"container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:134: jailing process inside rootfs caused: pivot_root invalid argument","time":"2024-03-05T17:05:23+08:00"}

rootfsPropagation is rshared because at least one mount is configured for bidirectional propagation.

I encountered the same problem. I also think that the rootfsParentMountPrivate function has an exception, although it does not return an error. rootfsParentMountPrivate does not change rootfs to private. The implementation of the rootfsParentMountPrivate function is not complicated. Maybe it encountered a race condition?

Did you finally locate the cause?

we found getParentMount func in the rootfsParentMountPrivate func return wrong mountPoint. In the common k8s case, it should return overlayFsMountPoint of container, while it returned "/run", which is the parent mount of overlayFsMountPoint. I suspect the root cause may be a bug in the kernel: sometimes the overlayFsMountPoint just created cannot be observed in the new mount namespace? cc @zhaodiaoer

zhaodiaoer commented 1 month ago

I think i have found the root cause of this problem, let me explain the complete picture of this problem:

TL;DR: Currently, the mechanism provided by github.com/moby/sys/mountinfo to obtain a complete mount list has a bug in its implementation on Linux. The process of traversing procfs is unsafe and there is a possibility of missing entries in the traversal result. I have also raised one issue for this.

Detail version:

  1. When handling containers configured with bidirectional propagation mounts, runc will do two additional things compared to containers without such configuration: 1. Adjust the root mount attribute of the container's namespace to rshared; 2. Change the parent mount option of the rootfs directory of the container to private; The second step is the dependence to use pivot_root to change the root of the file system for the container to the container's special rootfs, the problem also occurs in the second step.
  2. In a regular Kubernetes container environment, the rootfs of a container is a mount of the overlayfs type prepared by containerd. The rootfs directory is a mount point, so the parent mountpoint of the rootfs directory should be itself.
  3. But when there is a problem, the parent of the rootfs path found by runc is not itself, but the result of the second-longest match. runc changed the mount option to private for the wrong parent mount, but the actual parent mount stay rshared.
  4. RunC begin prepare the addition mounts under rootfs that exist in config, because current parent mount of rootfs is rshared, all these addition mounts propagated to host mnt namespace.
  5. When prepare done, runc use pivot_root to change the root of the file system for the container to the container's special rootfs directory, then the require that pivot_root need new root must under private mount not meet, runc log out error like error jailing process inside rootfs: pivot_root .: invalid argument and exit, but leaked mount still exist on host.

CC @LastNight1997 @fuweid

kolyshkin commented 1 month ago

I am well aware of the mountinfo reading bug; in fact, I have a whole repo devoted to the issue: https://github.com/kolyshkin/procfs-test.

This is a kernel bug, which is fixed in kernel v5.8 (see the above repo for details). Distro vendors should either upgrade their kernels, or backport the relevant patch (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f6c61f96f2d97cbb5f7fa85607bc398f843ff0f).

Theoretically, we can add a retry in getParentMount. Practically, this is very bad performance-wise.

kolyshkin commented 1 month ago

@zhaodiaoer thanks for investigating that. If you can figure out a reliable way to know if/when we should re-try reading mounts in getParentMount (so we can re-read it conditionally, not always), we can do that. But I'm opposed to always re-reading mounts.

Currently, the mechanism provided by github.com/moby/sys/mountinfo to obtain a complete mount list has a bug in its implementation on Linux. The process of traversing procfs is unsafe and there is a possibility of missing entries in the traversal result. I have also raised one issue for this.

Alas, this is a kernel bug, not a mountinfo package bug (otherwise we should have it fixed by now).

kolyshkin commented 1 month ago

Can anyone who has seen this issue test the proposed patch in https://github.com/opencontainers/runc/pull/4417 and report (in that PR, not here!) if it fixes the issue?

zhaodiaoer commented 1 month ago

I am well aware of the mountinfo reading bug; in fact, I have a whole repo devoted to the issue: https://github.com/kolyshkin/procfs-test.

This is a kernel bug, which is fixed in kernel v5.8 (see the above repo for details). Distro vendors should either upgrade their kernels, or backport the relevant patch (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f6c61f96f2d97cbb5f7fa85607bc398f843ff0f).

Theoretically, we can add a retry in getParentMount. Practically, this is very bad performance-wise.

Yes, after this kernel bug fix, this mount leak issue will be solved. very important information, Thanks !

zhaodiaoer commented 1 month ago

@zhaodiaoer thanks for investigating that. If you can figure out a reliable way to know if/when we should re-try reading mounts in getParentMount (so we can re-read it conditionally, not always), we can do that. But I'm opposed to always re-reading mounts.

I am still thinking of a solution, but I haven't come up with one yet...