Open kenorb opened 4 years ago
That's strange, we should already be allowing this for rootless containers:
// isIgnorableError returns whether err is a permission error (in the loose
// sense of the word). This includes EROFS (which for an unprivileged user is
// basically a permission error) and EACCES (for similar reasons) as well as
// the normal EPERM.
func isIgnorableError(rootless bool, err error) bool {
// We do not ignore errors if we are root.
if !rootless {
return false
}
// TODO: rm errors.Cause once we switch to %w everywhere
err = errors.Cause(err)
// Is it an ordinary EPERM?
if errors.Is(err, os.ErrPermission) {
return true
}
// Handle some specific syscall errors.
var errno unix.Errno
if errors.As(err, &errno) {
return errno == unix.EROFS || errno == unix.EPERM || errno == unix.EACCES
}
return false
}
(Note the EROFS
check.) I wonder if this is due to errors.As
not liking us using unix.Errno
instead of syscall.Errno
...
unix.Errno instead of syscall.Errno
AFAIK those are synonyms. Yes indeed, x/sys/unix defines it as
type Errno = syscall.Errno
So, you think you're using rc92 but it's not true. From your link:
...
wget -cqO /usr/local/bin/runc https://github.com/opencontainers/runc/releases/download/v1.0.0-rc92/runc.amd64 && chmod +x /usr/local/bin/runc
docker --version
runc --version
....
runc version 1.0.0-rc10
commit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
spec: 1.0.1-dev
I guess you're not using runc from /usr/local/sbin
.
Anyway, using the same runc version as you used, the code is slightly different, but it still should ignore EROFS
.
Might be an actual bug.
For now, I suggest you to retry with rc92 or latest git HEAD.
@AkihiroSuda PTAL?
I modified their example to correctly use rc92
and it has the same result:
ERRO[0000] container_linux.go:370: starting container process caused: process_linux.go:326: applying cgroup configuration for process caused: mkdir /sys/fs/cgroup/cpuset/container1: read-only file system
I think there might be something wrong with how we implemented the errors.As
conversion in b2272b2cba97817be7c0f173bbed9ab1d95e5349, because I'm pretty sure this worked with the original implementation. I'll take a closer look.
I think there might be something wrong with how we implemented the errors.As conversion in b2272b2
That was my primary suspect as well but since it's not working in rc10 either, it is probably not the case.
It works for me on cgroupv2 + systemd (Fedora 32), as well as cgroupv1 + systemd (CentOS 8) -- with both fs[2] and systemd[2] cgroup drivers. It's probably not working on colab because the host has cgroupfs readonly already.
I think the error is not ignored in this particular setup is because shouldUseRootlessCgroupManager
returns false
.
Indeed, if I run it as
runc --rootless=true run --no-new-keyring --no-pivot container1
I get a different error:
ERRO[0000] container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:59: mounting "proc" to rootfs at "/proc" caused: operation not permitted
So, it's a peculiarity of a particular environment -- it reports as if you're root (getuid
returns 0
, /proc/self/uid_map
shows you're root), while in fact you are not.
I'm inclined to close this one.
it reports as if you're root (getuid returns 0, /proc/self/uid_map shows you're root), while in fact you are not
This is what I mean:
id -a
cat /proc/self/uid_map
uid=0(root) gid=0(root) groups=0(root)
0 0 4294967295
Now, if we look into the implementation of shouldUseRootlessCgroupManager
we'll see it will return false
in this environment.
One thing I found out that I don't really like is "rootless" flag handling in runc. Filed issue https://github.com/opencontainers/runc/issues/2645.
Thanks for addressing the issue.
I've checked, and using the root
account, it's possible to remount /proc
and similar with the write access as:
%shell
mount -vt proc proc /proc -o rw,remount
mount -vt sysfs sysfs /sys -o rw,remount
mount -vt tmpfs tmpfs /sys/fs/cgroup -o rw,remount
which after remounting it is shown as:
proc on /proc type proc (rw,relatime)
sysfs on /sys type sysfs (rw,relatime)
tmpfs on /sys/fs/cgroup type tmpfs (rw,relatime,mode=755)
But I haven't found the way to remount the other cgroup sub-dirs (such as /sys/fs/cgroup/cpuset
however it can be unmounted fine), because of bad option, or maybe cgroup type doesn't exist or something.
However after the above and below code:
%%shell
cd busybox
runc spec --rootless
cat config.json | xargs
runc --rootless=true run --no-new-keyring --no-pivot container1
the error is:
ERRO[0000] container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"proc\\\" to rootfs \\\"/content/busybox/rootfs\\\" at \\\"/proc\\\" caused \\\"operation not permitted\\\"\""
container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"proc\\\" to rootfs \\\"/content/busybox/rootfs\\\" at \\\"/proc\\\" caused \\\"operation not permitted\\\"\""
I'm not quiet sure what the above error is about, as I'm able to successfully mount /proc
inside busybox/rootfs
manually by:
%%shell
mount -vt proc proc /content/busybox/rootfs/proc -o ro
mount -vt proc proc /content/busybox/rootfs/proc -o rw,remount
stat /content/busybox/rootfs/proc
Output:
mount: /content/busybox/rootfs/proc: proc already mounted on /proc.
mount: proc mounted on /content/busybox/rootfs/proc.
File: /content/busybox/rootfs/proc
...
So I think the relevant permission to mount proc is there.
mount -vt proc proc /content/busybox/rootfs/proc -o ro mount -vt proc proc /content/busybox/rootfs/proc -o rw,remount
I have yet to see the software that mounts something read-only first and then remounts it read-write. runc
is not doing that.
What you try to achieve is interesting nevertheless; please keep digging and inform us about your progress.
@kolyshkin Oh right, I didn't notice they were running as root -- so you're quite right that all of the rootless handling will not be exercised. :man_facepalming: I tested this on my box and if you run as an unprivileged user it works but as root it (as expected) does not. I agree that rootless handling is a bit hairy (and always has been), but I'll comment on the issue you opened to give some more context and hopefully we can improve the situation.
@kenorb The kernel has several protections against mounting pseudo-filesystems in certain contexts, one of which is that you cannot mount a filesystem like proc
inside a user namespace if all of the other visible mountpoints from your process have been "masked" by over-mounts. This is called the mnt_too_revealing
check and it boils down to "are there mounts on top of subdirectories in /proc
?" And when we look, we find quite a few:
% mount | grep 'on /proc'
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
proc on /proc/bus type proc (ro,relatime)
proc on /proc/fs type proc (ro,relatime)
proc on /proc/irq type proc (ro,relatime)
proc on /proc/sys type proc (ro,relatime)
proc on /proc/sysrq-trigger type proc (ro,relatime)
tmpfs on /proc/acpi type tmpfs (ro,relatime)
tmpfs on /proc/kcore type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/keys type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/timer_list type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,relatime)
(As an aside, this looks very similar to what we do in runc containers.)
This means that you are in a situation where you wouldn't be able to mount a proper procfs inside a user namespace.
There is work in the kernel to fix this issue (hidepid=4,subset=pid
) but unfortunately this probably isn't supported by the Google kernels. However since you are root, you can do something a little bit more dodgy like this:
% mkdir -p /tmp/.stashed-proc ; umount -f /tmp/.stashed-proc
% unshare -pf -- mount -t proc proc /tmp/.stashed-proc
This will create a procfs
mount which is not masked but contains no processes, which allows you to mount another procfs
inside a user namespace. And this works! Except now we hit a new issue when trying to switch roots:
WARN[0000] exit status 1
ERRO[0000] container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:117: jailing process inside rootfs caused: no such file or directory
So there's something going on inside the MS_MOVE
code (and annoyingly none of the errors are wrapped so we have to use strace
). A quick strace
later (grepping for ENOENT
) we find this:
% strace -fyy -o runc.trace -- runc --rootless=true run --no-new-keyring --no-pivot -b bundle container1
% grep ENOENT runc.trace | tail
[ snip ]
3190 mount("", "/proc/bus", 0xc0001e9a27, MS_REC|MS_SLAVE, NULL) = -1 ENOENT (No such file or directory)
[ snip ]
So it looks like the culprit is the masking code in msMoveRoot
(as I suspected). And looking at the code:
func msMoveRoot(rootfs string) error {
mountinfos, err := mountinfo.GetMounts(func(info *mountinfo.Info) (skip, stop bool) {
skip = false
stop = false
// Collect every sysfs and proc file systems, except those under the container rootfs
if (info.FSType != "proc" && info.FSType != "sysfs") || strings.HasPrefix(info.Mountpoint, rootfs) {
skip = true
return
}
return
})
if err != nil {
return err
}
for _, info := range mountinfos {
p := info.Mountpoint
// Be sure umount events are not propagated to the host.
if err := unix.Mount("", p, "", unix.MS_SLAVE|unix.MS_REC, ""); err != nil {
return err
}
if err := unix.Unmount(p, unix.MNT_DETACH); err != nil {
if err != unix.EINVAL && err != unix.EPERM {
return err
} else {
// If we have not privileges for umounting (e.g. rootless), then
// cover the path.
if err := unix.Mount("tmpfs", p, "tmpfs", 0, ""); err != nil {
return err
}
}
}
}
if err := unix.Mount(rootfs, "/", "", unix.MS_MOVE, ""); err != nil {
return err
}
return chroot()
}
The bug is that it will try to umount subdirectories of the filesystem after unmounting the parent (which will obviously result in ENOENT
) -- this is happening because subdirectories of procfs
are being remounted as ro
which looks like a separate procfs mount to this code. So we will need to rework this code so that it doesn't try to do that anymore -- it might just be as simple as checking whether the mount is the root of a procfs.
Okay, after applying #2647 I managed to get it to work in your environment @kenorb. @kolyshkin can you review #2647?
I'm trying to run
busybox
container in Colab, however I've got the following error:Here are the steps:
busybox/rootfs
:Demo: https://colab.research.google.com/drive/19hVpEODrL8kb7KvyWrA9vE6Pd7ZKMA4G#scrollTo=VhgKc1a6zMTq
Is there any way to run the container having read-only access to cgroup configuration?