Use squashfuse in native mode when 'allow kernel squashfs = no'

dtrudg commented 1 year ago

Is your feature request related to a problem? Please describe.

Recent versions of SingularityCE have a singularity.conf directive that permits disabling kernel mounts of squashfs, which are performed in the setuid flow. There is no elegant fall-back:

$ singularity run docker://alpine
INFO:    Using cached SIF image
FATAL:   container creation failed: squashfs image mounts are not authorized

Describe the solution you'd like

squashfuse is widely available, and a recent version is even bundled with SingularityCE.

It should be possible for the kernel mount to fall back to a squashfuse mount in the setuid flow.

DrDaveD commented 1 year ago

Note that for security reasons SingularityCE should avoid invoking an external program in setuid flow while having the ability to increase its privileges to root, and since it's not at that point in an unprivileged root-mapped user+mount namespace I am planning to implement the similar functionality in Apptainer based on the code for the --fusemount feature, where the starter-suid does the mount and then passes an open file descriptor to the FUSE program.

dtrudg commented 1 year ago

Note that for security reasons SingularityCE should avoid invoking an external program in setuid flow while having the ability to increase its privileges to root,

I'm not entirely clear what you mean? There are various places / ways we can do things in the setuid flow that don't involve the possibility of privilege escalation of an external binary... whether before the starter is invoked, or in dropped-privileged portions of the code.

and since it's not at that point in an unprivileged root-mapped user+mount namespace

Yes - it's important to us that this is not dependendent on being in an unprivileged root-mapped user namespace. The SLES 12 kernel, which we need to support, does not include support for FUSE mounts in an unprivileged user namespace.

I am planning to implement the similar functionality in Apptainer based on the code for the --fusemount feature, where the starter-suid does the mount and then passes an open file descriptor to the FUSE program.

Thanks for the link to your issue... that was along the lines of my initial thoughts, also.

DrDaveD commented 1 year ago

If it isn't in an unprivileged root-mapped user+mount namespace, then the FUSE program won't be able to do its own mounting without privileges ... unless you want to depend on it invoking the setuid-root fusermount/fusermount3, which as you know is problematic. So the compromise is to do a generic /dev/fuse mount first in setuid mode, then pass the file descriptor to the unprivileged FUSE program, as the --fusemount option does. That does require the fuse3 library to work, which is likely to cause some pain as I noted in the Apptainer issue. The fuse3 library accepts a /dev/fd referencing an open file descriptor in place of the mount point parameter to avoid having to do the mount itself.

dtrudg commented 1 year ago

We haven't decided the exact flow for this so far. I don't yet subscribe to the view that using a fd mount and fuse3 is definitively the best or only suitable way... but it is the first thing being considered. It's quite possible that SingularityCE will take a different approach to Apptainer, depending on the trade-offs that are most appropriate for our respective users.

If it isn't in an unprivileged root-mapped user+mount namespace, then the FUSE program won't be able to do its own mounting without privileges ... unless you want to depend on it invoking the setuid-root fusermount/fusermount3, which as you know is problematic.

At some point, we have to accept that set-uid is required in certain places, in order to get particular behaviours that userns doesn't provide, or to support older systems that lack certain kernel features/backports. There are always trade-offs... there is no solution that has zero problems. Whether we have more privileged code in Singularity, or call out to a distro provided tool is an open question.

So the compromise is to do a generic /dev/fuse mount first in setuid mode, then pass the file descriptor to the unprivileged FUSE program, as the --fusemount option does. That does require the fuse3 library to work, which is likely to cause some pain as I noted in the Apptainer issue. The fuse3 library accepts a /dev/fd referencing an open file descriptor in place of the mount point parameter to avoid having to do the mount itself.

Right - I'm not yet sure about bundling more FUSE binaries, where older distributions are shipping fuse2 versions, unless we have to.

fuse2fs does seem to be the biggest blocker for this approach - as far as I'm aware it cannot currently be built directly for fuse3, although it will work if a v3 fusermount is available?

Anyway.... these are some of the things that are being considered at this time. We're sure that each project will address the need in the appropriate way for themselves, while considering if the approach on the other side of the fork is applicable.

dtrudg commented 11 months ago

Having picked this up again, starting to poke around the code and think about it, there seem to be 3 basic approaches:

A - Rely on the availability of unprivileged user namespace creation to support FUSE in native mode, and wire up in a similar manner to that used in OCI mode. This is not ideal as we know of several sites / users who choose to disable unprivileged namespace creation due to their security posture. These sites are also some of those most likely to be interested in avoiding kernel mounts if possible. Also unsupported on SLES12.

B - Use the --fusemount mechanism or a similar approach, with pre-provisioning of an fd based mount inside the mount namespace. This is dependent on FUSE3, so is somewhat problematic for distributions which are generally using FUSE2. We don't really want to force usage of our own bundled FUSE3 squashfuse etc. Also, fuse2fs doesn't support FUSE3, which would prevent use of (much) older extfs format singularity images. I'd guess that sites who can't move to newer distros, and rootless runtimes, are also some of the most likely to need to run rather old extfs container images.

C - Follow the pattern of --sif-fuse in which the container rootfs is mounted in the host mount namespace, prior to invoking the starter, and cleaned up at container exit. This does pollute the host tmp dir with a mounted rootfs, and also relies on the distribution's setuid FUSE helpers to perform the FUSE mount. I would assume that sites that are happy with FUSE generally do have the setuid helpers in place, to allow e.g. sshfs mounts by users. There may be cleanup issues on job kill... but those may be avoided by the tendency of schedulers to now use cgroup based process monitoring, making it possible to kill the cgroup processes.

Leaning towards C at present due to the wish to support old images (extfs) and SLES12.

sylabs / singularity

Use squashfuse in native mode when 'allow kernel squashfs = no' #2216