nestybox / sysbox

An open-source, next-generation "runc" that empowers rootless containers to run workloads such as Systemd, Docker, Kubernetes, just like VMs.
Apache License 2.0
2.78k stars 152 forks source link

Sysbox does not handle nested bind-mounts into the container correctly #147

Closed ctalledo closed 3 years ago

ctalledo commented 3 years ago

When launching a container with Docker + Sysbox with nested host bind-mounts, the inner bind-mount shows up with nobody:nogroup permissions.

For example, below we launch a sysbox container with nested host bind-mounts of ~/tmp/hometest -> /home/admin and ~/tmp/hometest/docker -> /var/lib/docker:

$ docker run --runtime=sysbox-runc -it --rm --mount type=bind,source=$HOME/tmp/hometest,target=/home/admin --mount type=bind,source=$HOME/tmp/hometest/docker,target=/var/lib/docker nestybox/alpine-docker-dbg

/ # cd var/lib
/var/lib # ls -l
total 28
drwxr-xr-x    2 root     root          4096 May 29  2020 apk
drwxr-xr-x    3 root     root          4096 Dec  2 03:59 containerd
drwxr-xr-x    2 nobody   nobody        4096 Dec  2 03:45 docker
drwxr-xr-x    2 root     root          4096 Oct  5 16:01 iptables
drwxr-xr-x    2 root     root          4096 Dec  2 03:59 kubelet
drwxr-xr-x    2 root     root          4096 May 29  2020 misc
drwxr-xr-x    2 root     root          4096 May 29  2020 udhcpd

/var/lib # ls -l /home
total 4
drwxr-xr-x    3 root     root          4096 Dec  2 03:45 admin

As shown, inside the container, the inner bind-mount (~/tmp/hometest/docker -> /var/lib/docker) shows up as nobody:nogroup inside the container.

ctalledo commented 3 years ago

Upon further investigation, the problem only occurs when bind-mounting nested host dirs (e.g., /a/b, and /a/b/c) into a sys container, where the inner host dir (/a/b/c) is mounted on what Sysbox treats as a special dir (e.g., /var/lib/docker, /var/lib/kubelet). In addition, it only occurs when using uid-shifting (i.e., shiftfs).

The reason is that in this scenario, Sysbox will mount shiftfs on /a/b, and that shiftfs mount will also cover /a/b/c. But that's a problem, because we don't want shiftfs mounted on host dirs that are bind-mounted into the container's special dirs (/var/lib/docker).

The best solution at this time is to disallow such nested bind mounts. This means that when bind-mounting a host dir into a special dir (e.g., /a/b/c -> /var/lib/docker), you can't bind-mount a parent dir (/, /a, /a/b) into the same container.

The following sysbox-runc PR adds logic that checks for this when the container starts, and generates an appropriate error message.

https://github.com/nestybox/sysbox-runc/pull/21

Closing.