opencontainers / runc

CLI tool for spawning and running containers according to the OCI specification
https://www.opencontainers.org/
Apache License 2.0
11.78k stars 2.1k forks source link

Can't run containers "error while starting unit" with hidepid=2 + Systemd CGroupV2 + rootlesskit #3124

Open joanbm opened 3 years ago

joanbm commented 3 years ago

Attempting to run a container on Rootless Docker will fail when both of the following system settings are active:

Repro'd on Ubuntu 20.04 with Docker installed from Ubuntu PPAs (latest version, with Docker 20.10.7 + Containerd 1.4.9 + runc v1.0.1-0-g4144b63), also current Arch Linux.

Log of the problem:

~$ cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-5.4.0-80-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro cgroup_no_v1=all maybe-ubiquity
~$ sudo mount -o remount,hidepid=0 /proc && docker run --rm -it alpine:3.14 echo hello
hello
~$ sudo mount -o remount,hidepid=2 /proc && docker run --rm -it alpine:3.14 echo hello
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:385: applying cgroup configuration for process caused: error while starting unit "docker-d3d838421288422c6a904bb98f91e209a277e824c82aca3d96d66b2a9cf5b51a.scope" with properties [{Name:Description Value:"libcontainer container d3d838421288422c6a904bb98f91e209a277e824c82aca3d96d66b2a9cf5b51a"} {Name:Slice Value:"user.slice"} {Name:PIDs Value:@au [7918]} {Name:Delegate Value:true} {Name:MemoryAccounting Value:true} {Name:CPUAccounting Value:true} {Name:IOAccounting Value:true} {Name:TasksAccounting Value:true} {Name:DefaultDependencies Value:false}]: read unix @->/run/systemd/private: read: connection reset by peer: unknown.

I can also reproduce the error directly with runc without Docker; I just need run the rootful runc example, with --systemd-cgroup, and inside rootlesskit just like Docker does (both conditions are necessary, otherwise the container runs fine):

~$ mkdir mycontainer
~$ cd mycontainer
~/mycontainer$ mkdir rootfs
~/mycontainer$ docker export $(docker create alpine:3.14) | tar -C rootfs -xvf - >/dev/null
~/mycontainer$ runc spec
~/mycontainer$ sudo mount -o remount,hidepid=0 /proc && rootlesskit runc --systemd-cgroup run alpine
/ # 
~/mycontainer$ sudo mount -o remount,hidepid=2 /proc && rootlesskit runc --systemd-cgroup run alpine
WARN[0000] unable to get oom kill count                  error="open /sys/fs/cgroup/system.slice/runc-alpine.scope/memory.events: no such file or directory"
ERRO[0000] container_linux.go:380: starting container process caused: process_linux.go:385: applying cgroup configuration for process caused: error while starting unit "runc-alpine.scope" with properties [{Name:Description Value:"libcontainer container alpine"} {Name:Slice Value:"system.slice"} {Name:PIDs Value:@au [8134]} {Name:Delegate Value:true} {Name:MemoryAccounting Value:true} {Name:CPUAccounting Value:true} {Name:IOAccounting Value:true} {Name:TasksAccounting Value:true} {Name:DefaultDependencies Value:false}]: read unix @->/run/systemd/private: read: connection reset by peer 
[rootlesskit:child ] error: command [runc --systemd-cgroup run alpine] exited: exit status 1
[rootlesskit:parent] error: child exited: exit status 1

--

I did some debugging and it appears that the problem happens because runc is running busctl --user status here in order to get the OwnedUID value from the output: https://github.com/opencontainers/runc/blob/51beb5c436b159ae2d483b219c37ecfde13b006a/libcontainer/cgroups/systemd/user.go#L60

It appears that OwnerUID is not listed when /proc is mounted with hidepid=2:

~/mycontainer$ sudo mount -o remount,hidepid=0 /proc && busctl --user status
BusAddress=unix:path=/run/user/1000/bus
BusScope=user
BusID=153da85c556f02b0a554d6226106f2d4
PID=914
PPID=1
TTY=n/a
UID=1000
EUID=1000
SUID=1000
FSUID=1000
OwnerUID=1000
GID=1000
[...]
~/mycontainer$ sudo mount -o remount,hidepid=2 /proc && busctl --user status
BusAddress=unix:path=/run/user/1000/bus
BusScope=user
BusID=153da85c556f02b0a554d6226106f2d4
Failed to get credentials: No such process

With strace can see that busctl --user status is trying to read /proc/1/cgroup, which it can't because of hidepid=2.

From what I can see systemd are not big fans of hidepid=2 (e.g. https://lists.freedesktop.org/archives/systemd-devel/2012-October/006860.html, https://github.com/systemd/systemd/issues/12955) so I guess this could be a NOTOURBUG on runc and a WONTFIX on systemd, but it would be nice if we could have some equivalent logic that does not depend on busctl and avoid this issue.

For now as a workaround, I can create containers if I get the UID from the ROOTLESSKIT_PARENT_EUID environment variable instead:

diff --git a/libcontainer/cgroups/systemd/user.go b/libcontainer/cgroups/systemd/user.go
index 55d97c73..9cd9d6ba 100644
--- a/libcontainer/cgroups/systemd/user.go
+++ b/libcontainer/cgroups/systemd/user.go
@@ -57,6 +57,14 @@ func DetectUID() (int, error) {
    if !userns.RunningInUserNS() {
        return os.Getuid(), nil
    }
+   // START WORKAROUND for Rootless Docker + CgroupV2 + hidepid=2
+   if env := os.Getenv("ROOTLESSKIT_PARENT_EUID"); env != "" {
+       i, err := strconv.Atoi(env)
+       if err == nil {
+           return i, nil
+       }
+   }
+   // END WORKAROUND for Rootless Docker + CgroupV2 + hidepid=2
    b, err := exec.Command("busctl", "--user", "--no-pager", "status").CombinedOutput()
    if err != nil {
        return -1, fmt.Errorf("could not execute `busctl --user --no-pager status` (output: %q): %w", string(b), err)
AkihiroSuda commented 3 years ago

Workaround SGTM, could you open PR?

joanbm commented 3 years ago

I haven't done any serious testing with the workaround yet and I'd like a more generic solution not tied to RootlessKit, if I can free up some time I'll take a look at making a PR.

kailun-qin commented 3 years ago

It's great if we can have a more generic solution, but do we have a better approach?

Looks like the uid needs to be somehow explicitly input/shared in this case (though the naming of env variable should be aligned). Even no dependency on busctl cannot bypass the problem since the procfs is not accessible.

joanbm commented 3 years ago

A small update...

One of the two usages of DetectUID seems superfluous, it is used for authentication in DBus with the AUTH EXTERNAL method (https://dbus.freedesktop.org/doc/dbus-specification.html#auth-mechanisms), but as far as I can tell, it's not mandatory to send the UID along with the AUTH EXTERNAL method, this is hinted at in the documentation and it can be seen in the systemd source: https://github.com/systemd/systemd/blob/e8b08edcdf4e3f22be0a209cacb9e5404fee4b68/src/libsystemd/sd-bus/bus-socket.c#L312 . In fact there's a recent (yet unreleased) go-dbus commit which does exactly this: https://github.com/godbus/dbus/commit/31b5df72caaf5c68ec5ff414944e8ab8c24f8c52

The other usage is to decide whether to use the regular or rootless systemd cgroups manager, but I don't see any way to avoid the UID detection logic there.

As far as workarounds go, instead of looking for ROOTLESSKIT_PARENT_EUID, it seems cleaner to look at /proc/$$/cgroup for a fragment like /user-NNNN.slice/ to guess the UID, at least this way it's not tied to RootlessKit.