opencontainers / runc

CLI tool for spawning and running containers according to the OCI specification
https://www.opencontainers.org/
Apache License 2.0
11.8k stars 2.1k forks source link

EPERM mounting sysfs with rootless/userns container #3672

Open maleadt opened 1 year ago

maleadt commented 1 year ago

I'm trying out runc to get a simple unpriviliged containerized execution, but am having issues mounting sysfs:

"mounts": [
    {
        "destination": "/sys",
        "type": "sysfs",
        "source": "sysfs",
        "options": [
            "nosuid",
            "noexec",
            "nodev"
        ]
    }
]
❯ runc run test
ERRO[0000] runc run failed: unable to start container process: error during container init: error mounting "sysfs" to rootfs at "/sys": mount sysfs:/sys (via /proc/self/fd/7), flags: 0xe: operation not permitted

Meanwhile, crun manages fine:

❯ crun run test
root@test:~# mount | grep sysfs
sys on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
Full config ```json { "ociVersion": "1.0.1", "platform": { "os": "linux", "arch": "amd64" }, "root": { "path": "/home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f", "readonly": true }, "mounts": [ { "destination": "/proc", "type": "proc", "source": "proc" }, { "destination": "/dev", "type": "tmpfs", "source": "tmpfs", "options": [ "nosuid", "strictatime", "mode=755", "size=65536k" ] }, { "destination": "/dev/pts", "type": "devpts", "source": "devpts", "options": [ "nosuid", "noexec", "newinstance", "ptmxmode=0666", "mode=0620" ] }, { "destination": "/dev/shm", "type": "tmpfs", "source": "shm", "options": [ "nosuid", "noexec", "nodev", "mode=1777", "size=65536k" ] }, { "destination": "/dev/mqueue", "type": "mqueue", "source": "mqueue", "options": [ "nosuid", "noexec", "nodev" ] }, { "destination": "/sys", "type": "sysfs", "source": "sysfs", "options": [ "nosuid", "noexec", "nodev" ] }, { "destination": "/sys/fs/cgroup", "type": "cgroup", "source": "cgroup", "options": [ "nosuid", "noexec", "nodev", "relatime", "ro" ] } ], "process": { "terminal": true, "cwd": "/root", "env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "TERM=xterm" ], "args": [ "/bin/bash", "--login" ], "rlimits": [ { "type": "RLIMIT_NOFILE", "hard": 1024, "soft": 1024 } ], "capabilities": { "bounding": [ "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE" ], "permitted": [ "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE" ], "inheritable": [ "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE" ], "effective": [ "CAP_AUDIT_WRITE", "CAP_KILL" ], "ambient": [ "CAP_NET_BIND_SERVICE" ] }, "noNewPrivileges": true }, "user": { "uid": 0, "gid": 0 }, "hostname": "test", "linux": { "resources": { "devices": [ { "allow": false, "access": "rwm" } ] }, "namespaces": [ { "type": "pid" }, { "type": "ipc" }, { "type": "uts" }, { "type": "mount" }, { "type": "user" }, { "type": "cgroup" } ], "uidMappings": [ { "containerID": 0, "hostID": 1000, "size": 1 } ], "gidMappings": [ { "containerID": 0, "hostID": 1000, "size": 1 } ], "devices": null } } ```

Binding sys instead works around the issue:

"mounts": [
    {
        "destination": "/sys",
        "type": "none",
        "source": "/sys",
        "options": [
            "rbind",
            "nosuid",
            "noexec",
            "nodev",
            "ro"
        ]
    },
]
kolyshkin commented 1 year ago

I barely remember this depends on the kernel version, so some kernels (mistakenly) denied this mount.

Two possible solutions are:

  1. Upgrade the kernel
  2. Do not use rootless+userns+sysfs (lack of /sys might be OK for some containers).

I am not sure what are the implications of bind-mounting host /sys, and so I would not recommend doing that (without doing some security analysis first, that is).

kolyshkin commented 1 year ago

Now,

  1. This is not a runc bug (but rather a kernel bug)
  2. There's nothing runc can do about this (there's no easy workaround, and bind-mounting /sys is questionable)

Based on these two points, I am closing this as not-a-bug.

Let me know if you feel different.

maleadt commented 1 year ago

There's nothing runc can do about this (there's no easy workaround, and bind-mounting /sys is questionable)

But crun manages fine? I'm unfamiliar with the exact logic taking care of mounting sysfs, but this seems to indicate that there is a way to deal with this from the runtime's side.

Also, I'm happy to upgrade my kernel, but I'm using 5.15 -- the latest LTS -- which isn't exactly ancient. It's still what e.g. Ubuntu 22.04 is using/supporting for the next 5 years or so.

maleadt commented 1 year ago

Also, this reproduces on kernel 6.0.10 (Arch Linux)...

kolyshkin commented 1 year ago

OK, please tell us how to repro this (what is your environment and the steps to repro) and we'll take a look.

maleadt commented 1 year ago

OK, please tell us how to repro this (what is your environment and the steps to repro) and we'll take a look.

There's not much more to to it than what I've reported here:

./runc.amd64 run test
ERRO[0000] runc run failed: unable to start container process: error during container init: error mounting "sysfs" to rootfs at "/sys": mount sysfs:/sys (via /proc/self/fd/7), flags: 0xe: operation not permitted
g0dA commented 1 year ago

This is not runc bug, kernels denied this mount. this is right

why crun can mount sysfs?

because if in user namespace, crun bind /sys not sysfs

https://github.com/containers/crun/blob/2700598aa9df55945d09084ca035e1d140bc7f73/src/libcrun/linux.c#L1084

maleadt commented 1 year ago

I see; thanks!

kolyshkin commented 1 year ago

https://github.com/containers/crun/commit/6785cefbdf982c97a5552c9ce7017b0e8309c291

We should do the same for runc I guess

kolyshkin commented 1 year ago

Note that runc spec --rootless generates a spec which has /sys as a bind mount. I guess that is why we never saw this error. The code was added by #744 (specifically, commit d04cbc49d2ae4488a566eab86102c398522aaf14).

I think we still have to support replacing a proper /sys mount with a bind mount because crun does it.