Open cyphar opened 4 years ago
I think the only way we can reasonably do this at the moment is to try MNT_EXPIRE
and if it fails with -EPERM
we log a warning and continue. If we do the MNT_EXPIRE
after grabbing the fd, then we may get a false-positive if a mount occurs afterwards (but that's okay). However I need to figure out what happens if you MNT_EXPIRE
a lazy-umounted mount which is still alive through the fd). I imagine you get -EBUSY
as normal.
(Also, MS_MOVE
doesn't work because it doesn't permit moving a parent mount to a child -- but this check also happens to include moving a mount to itself).
statx has a mount id field we can use for this, but it was added in Linux 5.8 (after openat2)...
Another option that works as an unprivileged user is name_to_handle_at
but the old mount IDs it provides are recycled so this isn't a guarantee (I sent a patch to allow fetching the new mount IDs but that won't help on older kernels where we need this). But this is probably the best option.
Unfortunately, without AT_HANDLE_FID
(Linux 6.7) this doesn't work on some filesystems. But I suspect most users would be okay with it working on most filesystems.
While adding
RESOLVE_NO_XDEV
support to theopenat2(2)
backend is incredibly trivial (add theRESOLVE_NO_XDEV
flag), for the emulated backend it appears to be an open problem to detect a mount-point crossing (if it might be a bind-mount). Here is a list of things which don't work:Parsing
/proc/self/mountinfo
is both racy and requires you to trust/proc
(which is not a given, sinceRESOLVE_NO_XDEV
will be used for/proc
hardening -- see #7). There is a poll backend which in principle might allow you to do a double-check that could be safe, but the dependency on/proc
makes this a no-go.Doing
umount(MNT_EXPIRE)
ormount(MS_MOVE)
to check if you get an-EINVAL
(meaning it's not a mount-point) would appear to be the most obvious solution, but it requires privileges (either the ability to do the mount outright or the ability to create a user namespace to then do the mount). This blocks us from working in environments such as the default seccomp profile of most container runtimes.Creating a temporary (and not-bound-to-the-filesystem)
procfs
using the new mount API would also work -- except it requires quite a few privileges (if there are over-mounts you'll get permission issues in user namespaces, and you need to be able to mount things in general) and it requires a new-enough kernel.Right now, I think there is no obvious way to do this on older kernels -- which means we will have to output some kind of warning if running on a kernel without
openat2(2)
support. We can at the very least ensure we're not following symlinks and we never jump to a nonprocfs
mount -- but these are completely bypass-able limitations.