process_linux.go:297: applying cgroup configuration for process: read-only file system

kenorb commented 4 years ago

I'm trying to run busybox container in Colab, however I've got the following error:

WARN[0000] signal: killed                               
ERRO[0000] container_linux.go:349: starting container process caused "process_linux.go:297: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/cpuset/container1: read-only file system\"" 
container_linux.go:349: starting container process caused "process_linux.go:297: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/cpuset/container1: read-only file system\""

Here are the steps:

Install Docker and runc:

%%shell
curl -s https://download.docker.com/linux/static/stable/x86_64/docker-19.03.9.tgz | tar vxz --strip=1 -C /usr/local/bin/
wget -cqO /usr/local/bin/runc https://github.com/opencontainers/runc/releases/download/v1.0.0-rc92/runc.amd64 && chmod +x /usr/local/bin/runc
docker --version
runc --version

Extract Busybox container into busybox/rootfs:

%%shell
dockerd -b none --iptables=0 -l warn &
sleep 1
mkdir -pv busybox/rootfs
docker export $(docker create busybox) | tar -C busybox/rootfs -xf -
kill $(jobs -p)

Run:

%%shell
cd busybox
runc spec --rootless
runc run --no-new-keyring --no-pivot container1

Demo: https://colab.research.google.com/drive/19hVpEODrL8kb7KvyWrA9vE6Pd7ZKMA4G#scrollTo=VhgKc1a6zMTq

Is there any way to run the container having read-only access to cgroup configuration?

cyphar commented 4 years ago

That's strange, we should already be allowing this for rootless containers:

// isIgnorableError returns whether err is a permission error (in the loose
// sense of the word). This includes EROFS (which for an unprivileged user is
// basically a permission error) and EACCES (for similar reasons) as well as
// the normal EPERM.
func isIgnorableError(rootless bool, err error) bool {
    // We do not ignore errors if we are root.
    if !rootless {
        return false
    }
    // TODO: rm errors.Cause once we switch to %w everywhere
    err = errors.Cause(err)
    // Is it an ordinary EPERM?
    if errors.Is(err, os.ErrPermission) {
        return true
    }
    // Handle some specific syscall errors.
    var errno unix.Errno
    if errors.As(err, &errno) {
        return errno == unix.EROFS || errno == unix.EPERM || errno == unix.EACCES
    }
    return false
}

(Note the EROFS check.) I wonder if this is due to errors.As not liking us using unix.Errno instead of syscall.Errno...

kolyshkin commented 4 years ago

unix.Errno instead of syscall.Errno

AFAIK those are synonyms. Yes indeed, x/sys/unix defines it as

type Errno = syscall.Errno

kolyshkin commented 4 years ago

So, you think you're using rc92 but it's not true. From your link:

...
wget -cqO /usr/local/bin/runc https://github.com/opencontainers/runc/releases/download/v1.0.0-rc92/runc.amd64 && chmod +x /usr/local/bin/runc
docker --version
runc --version
....
runc version 1.0.0-rc10
commit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
spec: 1.0.1-dev

I guess you're not using runc from /usr/local/sbin.

Anyway, using the same runc version as you used, the code is slightly different, but it still should ignore EROFS.

Might be an actual bug.

For now, I suggest you to retry with rc92 or latest git HEAD.

@AkihiroSuda PTAL?

cyphar commented 4 years ago

I modified their example to correctly use rc92 and it has the same result:

ERRO[0000] container_linux.go:370: starting container process caused: process_linux.go:326: applying cgroup configuration for process caused: mkdir /sys/fs/cgroup/cpuset/container1: read-only file system

I think there might be something wrong with how we implemented the errors.As conversion in b2272b2cba97817be7c0f173bbed9ab1d95e5349, because I'm pretty sure this worked with the original implementation. I'll take a closer look.

kolyshkin commented 4 years ago

I think there might be something wrong with how we implemented the errors.As conversion in b2272b2

That was my primary suspect as well but since it's not working in rc10 either, it is probably not the case.

It works for me on cgroupv2 + systemd (Fedora 32), as well as cgroupv1 + systemd (CentOS 8) -- with both fs[2] and systemd[2] cgroup drivers. It's probably not working on colab because the host has cgroupfs readonly already.

I think the error is not ignored in this particular setup is because shouldUseRootlessCgroupManager returns false.

Indeed, if I run it as

runc --rootless=true run --no-new-keyring --no-pivot container1

I get a different error:

ERRO[0000] container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:59: mounting "proc" to rootfs at "/proc" caused: operation not permitted

So, it's a peculiarity of a particular environment -- it reports as if you're root (getuid returns 0, /proc/self/uid_map shows you're root), while in fact you are not.

I'm inclined to close this one.

kolyshkin commented 4 years ago

it reports as if you're root (getuid returns 0, /proc/self/uid_map shows you're root), while in fact you are not

This is what I mean:

id -a
cat /proc/self/uid_map
uid=0(root) gid=0(root) groups=0(root)
         0          0 4294967295

Now, if we look into the implementation of shouldUseRootlessCgroupManager

https://github.com/opencontainers/runc/blob/2b31437caa905b7b944a891aee613e7dd0a1f898/rootless_linux.go#L12

we'll see it will return false in this environment.

kolyshkin commented 4 years ago

One thing I found out that I don't really like is "rootless" flag handling in runc. Filed issue https://github.com/opencontainers/runc/issues/2645.

kenorb commented 4 years ago

Thanks for addressing the issue.

I've checked, and using the root account, it's possible to remount /proc and similar with the write access as:

%shell
mount -vt proc proc /proc -o rw,remount
mount -vt sysfs sysfs /sys -o rw,remount
mount -vt tmpfs tmpfs /sys/fs/cgroup -o rw,remount

which after remounting it is shown as:

proc on /proc type proc (rw,relatime)
sysfs on /sys type sysfs (rw,relatime)
tmpfs on /sys/fs/cgroup type tmpfs (rw,relatime,mode=755)

But I haven't found the way to remount the other cgroup sub-dirs (such as /sys/fs/cgroup/cpuset however it can be unmounted fine), because of bad option, or maybe cgroup type doesn't exist or something.

However after the above and below code:

%%shell
cd busybox
runc spec --rootless
cat config.json | xargs
runc --rootless=true run --no-new-keyring --no-pivot container1

the error is:

ERRO[0000] container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"proc\\\" to rootfs \\\"/content/busybox/rootfs\\\" at \\\"/proc\\\" caused \\\"operation not permitted\\\"\"" 
container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"proc\\\" to rootfs \\\"/content/busybox/rootfs\\\" at \\\"/proc\\\" caused \\\"operation not permitted\\\"\""

I'm not quiet sure what the above error is about, as I'm able to successfully mount /proc inside busybox/rootfs manually by:

%%shell
mount -vt proc proc /content/busybox/rootfs/proc -o ro
mount -vt proc proc /content/busybox/rootfs/proc -o rw,remount
stat /content/busybox/rootfs/proc

Output:

mount: /content/busybox/rootfs/proc: proc already mounted on /proc.
mount: proc mounted on /content/busybox/rootfs/proc.
  File: /content/busybox/rootfs/proc
...

So I think the relevant permission to mount proc is there.

kolyshkin commented 4 years ago

mount -vt proc proc /content/busybox/rootfs/proc -o ro mount -vt proc proc /content/busybox/rootfs/proc -o rw,remount

I have yet to see the software that mounts something read-only first and then remounts it read-write. runc is not doing that.

What you try to achieve is interesting nevertheless; please keep digging and inform us about your progress.

cyphar commented 4 years ago

@kolyshkin Oh right, I didn't notice they were running as root -- so you're quite right that all of the rootless handling will not be exercised. :man_facepalming: I tested this on my box and if you run as an unprivileged user it works but as root it (as expected) does not. I agree that rootless handling is a bit hairy (and always has been), but I'll comment on the issue you opened to give some more context and hopefully we can improve the situation.

cyphar commented 4 years ago

@kenorb The kernel has several protections against mounting pseudo-filesystems in certain contexts, one of which is that you cannot mount a filesystem like proc inside a user namespace if all of the other visible mountpoints from your process have been "masked" by over-mounts. This is called the mnt_too_revealing check and it boils down to "are there mounts on top of subdirectories in /proc?" And when we look, we find quite a few:

% mount | grep 'on /proc'
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
proc on /proc/bus type proc (ro,relatime)
proc on /proc/fs type proc (ro,relatime)
proc on /proc/irq type proc (ro,relatime)
proc on /proc/sys type proc (ro,relatime)
proc on /proc/sysrq-trigger type proc (ro,relatime)
tmpfs on /proc/acpi type tmpfs (ro,relatime)
tmpfs on /proc/kcore type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/keys type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/timer_list type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,relatime)

(As an aside, this looks very similar to what we do in runc containers.)

This means that you are in a situation where you wouldn't be able to mount a proper procfs inside a user namespace. There is work in the kernel to fix this issue (hidepid=4,subset=pid) but unfortunately this probably isn't supported by the Google kernels. However since you are root, you can do something a little bit more dodgy like this:

% mkdir -p /tmp/.stashed-proc ; umount -f /tmp/.stashed-proc
% unshare -pf -- mount -t proc proc /tmp/.stashed-proc

This will create a procfs mount which is not masked but contains no processes, which allows you to mount another procfs inside a user namespace. And this works! Except now we hit a new issue when trying to switch roots:

WARN[0000] exit status 1                                
ERRO[0000] container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:117: jailing process inside rootfs caused: no such file or directory

So there's something going on inside the MS_MOVE code (and annoyingly none of the errors are wrapped so we have to use strace). A quick strace later (grepping for ENOENT) we find this:

% strace -fyy -o runc.trace -- runc --rootless=true run --no-new-keyring --no-pivot -b bundle container1
% grep ENOENT runc.trace | tail
[ snip ]
3190  mount("", "/proc/bus", 0xc0001e9a27, MS_REC|MS_SLAVE, NULL) = -1 ENOENT (No such file or directory)
[ snip ]

So it looks like the culprit is the masking code in msMoveRoot (as I suspected). And looking at the code:

func msMoveRoot(rootfs string) error {
    mountinfos, err := mountinfo.GetMounts(func(info *mountinfo.Info) (skip, stop bool) {
        skip = false
        stop = false
        // Collect every sysfs and proc file systems, except those under the container rootfs
        if (info.FSType != "proc" && info.FSType != "sysfs") || strings.HasPrefix(info.Mountpoint, rootfs) {
            skip = true
            return
        }
        return
    })
    if err != nil {
        return err
    }

    for _, info := range mountinfos {
        p := info.Mountpoint
        // Be sure umount events are not propagated to the host.
        if err := unix.Mount("", p, "", unix.MS_SLAVE|unix.MS_REC, ""); err != nil {
            return err
        }
        if err := unix.Unmount(p, unix.MNT_DETACH); err != nil {
            if err != unix.EINVAL && err != unix.EPERM {
                return err
            } else {
                // If we have not privileges for umounting (e.g. rootless), then
                // cover the path.
                if err := unix.Mount("tmpfs", p, "tmpfs", 0, ""); err != nil {
                    return err
                }
            }
        }
    }
    if err := unix.Mount(rootfs, "/", "", unix.MS_MOVE, ""); err != nil {
        return err
    }
    return chroot()
}

The bug is that it will try to umount subdirectories of the filesystem after unmounting the parent (which will obviously result in ENOENT) -- this is happening because subdirectories of procfs are being remounted as ro which looks like a separate procfs mount to this code. So we will need to rework this code so that it doesn't try to do that anymore -- it might just be as simple as checking whether the mount is the root of a procfs.

cyphar commented 4 years ago

Okay, after applying #2647 I managed to get it to work in your environment @kenorb. @kolyshkin can you review #2647?

opencontainers / runc

process_linux.go:297: applying cgroup configuration for process: read-only file system #2639