nestybox / sysbox

An open-source, next-generation "runc" that empowers rootless containers to run workloads such as Systemd, Docker, Kubernetes, just like VMs.
Apache License 2.0
2.78k stars 152 forks source link

Overlayfs can't be mounted from within a sys container (except on Ubuntu) #83

Closed rodnymolina closed 4 years ago

rodnymolina commented 4 years ago

In the mainline Linux kernel, it's not possible to mount overlayfs from within a container (or more accurately from outside the initial user-namespace). Doing so causes permission denied response.

To reproduce the issue, simply enter a user-namespace and the mount overlayfs:

cd /home/chino/sandbox/overlayfs
mkdir lower upper work merged
unshare -i -m -n -p -u -U -C -f --mount-proc -r /bin/bash
mount -t overlay overlay -o lowerdir=/home/chino/sandbox/overlayfs/lower,upperdir=/home/chino/sandbox/overlayfs/upper,workdir=/home/chino/sandbox/overlayfs/work /home/chino/sandbox/overlayfs/merged
mount: /home/chino/sandbox/overlayfs/merged: permission denied.

It's not clear to me why this restriction exists; it may be related to the security issue described in this lwn.net article.

This is a problem as it won't allow us to run system containers on ext4, because if an inner docker daemon is launched, the inner docker will try to mount overlayfs for the container images and this operation will fail.

Fortunately the problem does not occur on Ubuntu. There appears to be a patch from Ubuntu that allows this. As described in here:

"Ubuntu carries a patch that allows overlayfs mounting inside of an unprivileged user namespace, so we were carrying the fix mentioned above as a delta against the upstream Linux kernel since the issue didn't affect upstream overlayfs. "

Note that the problem does not affect system containers on btrfs, because in that case overlayfs is not used by the inner docker; it uses btrfs subvolumes.

(Ref #62)

rodnymolina commented 4 years ago

Just spent some time testing this one on Fedora-30. Unfortunately, we are reproducing the same issue there:

[root@fedora-30 overlayfs]# mount -t overlay overlay -o lowerdir=/home/rodny/overlayfs/lower,upperdir=/home/rodny/overlayfs/upper,workdir=/home/rodny/overlayfs/work /home/rodny/overlayfs/merged
mount: /home/rodny/overlayfs/merged: permission denied.
[root@fedora-30 overlayfs]#

[rodny@fedora-30 ~]$ uname -a
Linux fedora-30 5.0.16-300.fc30.x86_64 #1 SMP Tue May 14 19:33:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[rodny@fedora-30 ~]$
rodnymolina commented 4 years ago

We recently added mount syscall interception to sysbox system containers.

We can leverage this feature to solve this issue, by having sysbox intercept mounts of overlayfs by processes in the sys container and perform those on behalf of the sys container.

This would bypass the permission problems, since sysbox is true root on the host.

More importantly, it would make Sysbox less dependent on Ubuntu, opening the door to supporting other distros.

Note however that mount syscall interception relies on very recent linux kernels (seccomp-notify mechanism + seccomp-notify "continue").

rodnymolina commented 4 years ago

NOTE: This issue, combined with issue #160, mean that support for system containers on ext4 will require Ubuntu Disco (linux kernel 5.0).

rodnymolina commented 4 years ago

As expected, problem is easily reproduced in Centos 8 too:

[root@centos-8-vm ~]# uname -a
Linux centos-8-vm 4.18.0-193.6.3.el8_2.x86_64 #1 SMP Wed Jun 10 11:09:32 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

root@centos-8-vm ~]# mkdir lower upper work merged
[root@centos-8-vm ~]# ls -lrt
total 16
drwxr-xr-x. 2 root root    6 Jul 12 04:37 upper
drwxr-xr-x. 2 root root    6 Jul 12 04:37 lower
drwxr-xr-x. 2 root root    6 Jul 12 04:37 work
drwxr-xr-x. 2 root root    6 Jul 12 04:37 merged
[root@centos-8-vm ~]#

[root@centos-8-vm ~]# unshare -i -m -n -p -u -U -C -f --mount-proc -r /bin/bash
[root@centos-8-vm ~]# ls -lrt
total 16
drwxr-xr-x. 2 root root    6 Jul 12 04:37 upper
drwxr-xr-x. 2 root root    6 Jul 12 04:37 lower
drwxr-xr-x. 2 root root    6 Jul 12 04:37 work
drwxr-xr-x. 2 root root    6 Jul 12 04:37 merged
[root@centos-8-vm ~]#

[root@centos-8-vm ~]# pwd
/root

[root@centos-8-vm ~]# mount -t overlay overlay -o lowerdir=/root/lower,upperdir=/root/upper,workdir=/root/work /root/merged
mount: /root/merged: permission denied.
[root@centos-8-vm ~]#
rodnymolina commented 4 years ago

Most of the non-debian based distros are incapable of mounting overlayfs over unprivileged user-namespaces. They seem to be relying on fuse-overlayfs tool to workaround this issue.

In Redhat's case, their kernel will allow fuse-overlayfs utilization starting in 4.18+, and they are even considering to backport fuse-overlayfs to 3.10 kernel. On the other hand, they are fully aware of the runtime penalty that implies running this feature in user-space. See more details here:

https://indico.cern.ch/event/757415/contributions/3421994/attachments/1855302/3047064/Podman_Rootless_Containers.pdf https://www.redhat.com/sysadmin/behind-scenes-podman

rodnymolina commented 4 years ago

As part of this task, we should also investigate what other distros have this problem.

rodnymolina commented 4 years ago

Just got a working implementation of overlayfs-mount handler by making use of our syscall-trapping infrastructure. There is still one loose-end to take care of (i.e. docker nesting not working yet), but at least we can now successfully mount overlayfs within an unprivileged user-namespace context.

root@test-1:~# cd /var/lib/docker
root@test-1:/var/lib/docker#

root@test-1:/var/lib/docker# mkdir rodny
root@test-1:/var/lib/docker# cd rodny/

root@test-1:/var/lib/docker/rodny# mkdir lower upper work merged
root@test-1:/var/lib/docker/rodny#

<-- Before changes ...

root@test-1:/var/lib/docker/rodny# mount -t overlay overlay -olowerdir=/var/lib/docker/rodny/lower,upperdir=/var/lib/docker/rodny/upper,workdir=/var/lib/docker/rodny/work /var/lib/docker/rodny/merged
mount: /var/lib/docker/rodny/merged: permission denied.
root@test-1:/var/lib/docker/rodny#

<-- After ...

root@test-1:/var/lib/docker/rodny# mount -t overlay overlay -olowerdir=/var/lib/docker/rodny/lower,upperdir=/var/lib/docker/rodny/upper,workdir=/var/lib/docker/rodny/work /var/lib/docker/rodny/merged

root@test-1:/var/lib/docker/rodny# findmnt
TARGET                                SOURCE                 FSTYPE   OPTIONS
...
|-/var/lib/docker                     /dev/vda1[/var/lib/sysbox/docker/baseVol/63e6efda3700689e415b795304167c364190800f21c62b9bd7b915a6154d86d4]
|                                                            xfs      rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota
| `-/var/lib/docker/rodny/merged      overlay                overlay
rw,relatime,seclabel,lowerdir=/var/lib/docker/rodny/lower,upperdir=/var/lib/docker/rodny/upper,workdir=/var/lib/docker/rodny/work
rodnymolina commented 4 years ago

Fixed by PR https://github.com/nestybox/sysbox-fs/pull/10. Closing.