RESOLVE_NO_XDEV support for emulated backend

cyphar commented 4 years ago

While adding RESOLVE_NO_XDEV support to the openat2(2) backend is incredibly trivial (add the RESOLVE_NO_XDEV flag), for the emulated backend it appears to be an open problem to detect a mount-point crossing (if it might be a bind-mount). Here is a list of things which don't work:

Parsing /proc/self/mountinfo is both racy and requires you to trust /proc (which is not a given, since RESOLVE_NO_XDEV will be used for /proc hardening -- see #7). There is a poll backend which in principle might allow you to do a double-check that could be safe, but the dependency on /proc makes this a no-go.
Doing umount(MNT_EXPIRE) or mount(MS_MOVE) to check if you get an -EINVAL (meaning it's not a mount-point) would appear to be the most obvious solution, but it requires privileges (either the ability to do the mount outright or the ability to create a user namespace to then do the mount). This blocks us from working in environments such as the default seccomp profile of most container runtimes.
Creating a temporary (and not-bound-to-the-filesystem) procfs using the new mount API would also work -- except it requires quite a few privileges (if there are over-mounts you'll get permission issues in user namespaces, and you need to be able to mount things in general) and it requires a new-enough kernel.

Right now, I think there is no obvious way to do this on older kernels -- which means we will have to output some kind of warning if running on a kernel without openat2(2) support. We can at the very least ensure we're not following symlinks and we never jump to a non procfs mount -- but these are completely bypass-able limitations.

cyphar commented 4 years ago

I think the only way we can reasonably do this at the moment is to try MNT_EXPIRE and if it fails with -EPERM we log a warning and continue. If we do the MNT_EXPIRE after grabbing the fd, then we may get a false-positive if a mount occurs afterwards (but that's okay). However I need to figure out what happens if you MNT_EXPIRE a lazy-umounted mount which is still alive through the fd). I imagine you get -EBUSY as normal.

(Also, MS_MOVE doesn't work because it doesn't permit moving a parent mount to a child -- but this check also happens to include moving a mount to itself).

cyphar commented 1 month ago

statx has a mount id field we can use for this, but it was added in Linux 5.8 (after openat2)...

cyphar commented 1 month ago

Another option that works as an unprivileged user is name_to_handle_at but the old mount IDs it provides are recycled so this isn't a guarantee (I sent a patch to allow fetching the new mount IDs but that won't help on older kernels where we need this). But this is probably the best option.

Unfortunately, without AT_HANDLE_FID (Linux 6.7) this doesn't work on some filesystems. But I suspect most users would be okay with it working on most filesystems.

openSUSE / libpathrs

RESOLVE_NO_XDEV support for emulated backend #8