Closed apyrgio closed 2 weeks ago
I believe this is the same issue being described by @terenceli in this blog post. @avagin helped with diagnosing this. See issue #8205 and https://github.com/opencontainers/runc/issues/1658.
That said I don't understand why this would be release-related, as I don't think something changed in the startup process that would change this behavior...
@avagin bisected this to commit cc1f5503f1fe220ed813875d1305003e04530051, specifically in runsc/cmd/chroot.go
:
- if pidns {
- flags := uint32(unix.MS_NOSUID | unix.MS_NODEV | unix.MS_NOEXEC | unix.MS_RDONLY)
- if err := mountInChroot(chroot, "proc", "/proc", "proc", flags); err != nil {
- return fmt.Errorf("error mounting proc in chroot: %v", err)
- }
- } else {
- if err := mountInChroot(chroot, "/proc", "/proc", "bind", unix.MS_BIND|unix.MS_RDONLY|unix.MS_REC); err != nil {
- return fmt.Errorf("error mounting proc in chroot: %v", err)
- }
+ flags := uint32(unix.MS_NOSUID | unix.MS_NODEV | unix.MS_NOEXEC | unix.MS_RDONLY)
+ if err := mountInChroot(chroot, "proc", "/proc", "proc", flags); err != nil {
+ return fmt.Errorf("error mounting proc in chroot: %v", err)
}
This means that the chroot used to use a recursive read-only bind mount of /proc
(MS_BIND | MS_READONLY | MS_REC
), but now it mounts a new read-only procfs instead, and this fails due to the reasons explained in #8205.
I will add a test to //runsc:container_test
that runs runsc do
in a non-runsc Docker container, in order to ensure this doesn't break again in the future.
This means that the chroot used to use a recursive read-only bind mount of /proc
That was only the case for ptrace & systrap platform (the pidns variable would be false for these). For other platforms, it was still doing what it was doing today. https://github.com/google/gvisor/commit/cc1f5503f1fe220ed813875d1305003e04530051 just got rid of the ptrace/systrap special case. Systrap and ptrace were running in the caller's pidns.
We still want to execute the sandbox process in a new pidns, so can't bind mount the procfs mount because it is presenting data for the parent pidns.
What fix do you suggest?
@ayushr2 the sandbox process access only generic files and /proc/self, so we actually don't need proc from the target pid namespace.
@avagin If you think bind mounting the proc mount is OK, I would defer to you. It seems a bit of a foot gun to do this to me, as it can lead to surprises if the sandbox tries to do any /proc/PID
-kinda work in the future. There is no good way to enforce that no future changes will use proc files other than "only generic files and /proc/self". Maybe seccomp?
What fix do you suggest?
It is important to run the sandbox in different pidns so the sandbox can not impact host pidns with fork bombs and such calamities. I don't have a different fix in mind yet unfortunately...
Ok, I guess I get what's going on with overmounting and how it affects procfs
. These particular comments helped a bit:
So, the underlying issue is that when the inner container attempts to mount procfs
, it will reveal some security-sensitive /proc/
paths (e.g., /proc/kcore
) that were masked by the outer container (not sure how Docker does it, perhaps mounting an empty dir on top of them). The Linux kernel is smart enough to prevent this, cool.
I went down the rabbit hole a bit and found out the following:
--security-opt systempaths=unconfined
flag, in docker run
(docs).
subset=pid
mount option, that "[...] hides all top level files and directories in the procfs that are not related to tasks.".So, could gVisor perhaps mount procfs
using that option?
Turns out that as of Linux 5.8, there's a subset=pid mount option, that "[...] hides all top level files and directories in the procfs that are not related to tasks.".
When runsc is started, it reads a few top level files such as /proc/cpuinfo, /proc/sys/vm/mmap_min_addr, /proc/self/auxv, /proc/sys/kernel/cap_last_cap.
Could we mount procfs
with subset=pid
, and then separately also bind-mount /proc/{cpuinfo,sys/vm/mmap_min_addr,sys/kernel/cap_last_cap}
somewhere?
Turns out that as of Linux 5.8, there's a subset=pid mount option, that "[...] hides all top level files and directories in the procfs that are not related to tasks.".
subset=pid doesn't help to avoid this issue. The kernel still does the same check and doesn't allow us to create a new proc instance:
$ unshare -Urmfp mount -t proc -o subset=pid proc /proc
avagin@avagin3:~$ echo $?
0
$ sudo mount --bind /dev/null /proc/sysrq-trigger
$ unshare -Urmfp mount -t proc -o subset=pid proc /proc
mount: /proc: permission denied.
dmesg(1) may have more information after failed mount system call.
We can do something like https://github.com/google/gvisor/commit/231c1521ce79b12ee921ecbb9b932b0763711d71
Thanks for looking into it @avagin. I tried your commands (https://github.com/google/gvisor/issues/10944#issuecomment-2387214023) and indeed they failed with "Mount too revealing". Weird...
For what is worth, your workaround looks fine to me.
Hey folks. Just checking if this issue will be worked on the next releases. We'll release a new Dangerzone version in a few days and it would be nice to offer the latest gVisor version. Currently, we are pinned to version 20240826
. If there's no fix any time soon, that's fine, we'll try to think of a workaround for this issue. Thanks!
I will be picking up @avagin's patch and submitting it soon.
Should be fixed with 6adc0720b2e66d3dee7e115d93ec3347f9a8a212.
Description
When running a Dangerzone container image with the latest gVisor release (release-20240916.0), we stumble onto the following error:
Building the container image with the previous release (release-20240826.0) works. Running the outer container with
--privileged
also works, but not withCAP_SYS_ADMIN
.(reminder, in the Dangerzone project, gVisor runs nested within a Docker/Podman container. I can verify the error is the same regardless of the container runtime, Linux kernel, enforced capabilities)
Steps to reproduce
Unfortunately, I don't have a minimum reproducible example for this. The way we have reproduced it for now is:
BUILD.md
poetry run ./dev/dangerzone-cli tests/test_docs/sample-pdf.pdf
.runsc version
docker version (if using docker)
No response
uname
Linux 88387f6d4d93 6.5.11-linuxkit #1 SMP PREEMPT_DYNAMIC Wed Dec 6 17:14:50 UTC 2023 x86_64 Linux
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)