syscall: add bind/mount operations to SysProcAttr on Linux

sargun commented 9 years ago

It would be nice to be able to pass a list of bind mounts to ForkAndExec via SysProcAttr, that Go bind mounts after forking (and execing): https://golang.org/pkg/syscall/#SysProcAttr. Alternatively, it would be nice to have the ability to pass a lambda to it that would call some syscalls (or raw syscalls) before it execs.

ianlancetaylor commented 9 years ago

Our basic guideline for SysProcAttr is that we only add operations that must take place between the fork and execve calls. As far as I know bind mounts are not such a call: you could instead exec a small helper program that did the bind mounts, and then execed the real program. Running that small helper program is inconvenient, but better than adding new obscure features to the syscall package.

It's not possible to have a function that runs between fork and exec, because many Go operations are not available during that time. Only very carefully written Go code can happen in that space. Permitting an arbitrary function would be an invitation to bugs and a severe restriction on what we could do going forward.

Is there any reason a bind mount has to happen between fork and execve?

rsc commented 8 years ago

@sargun, arbitrary func calls won't happen, for the reasons Ian explained. They'll be basically impossible to use correctly.

@ianlancetaylor, the argument you're making also applies to closing file descriptors, yet we do that. So there must be a slightly different line. My guess is that it's based on how common/complex the operations are. New mounts do seem a bit rare.

In Plan 9, rfork(2) let you change the current process; you only got a new process if you included the RFPROC bit. So on a Plan 9 system you could call rfork(RFNAMEG) to put the calling thread in its own name space group, do you bind mounts, and then fork/exec. Translated into Go, it would be something like

go func() {
    runtime.LockOSThread()
    syscall.Rfork(syscall.RFNAMEG)
    ... binds/mounts ...
    result = ForkExec(...)
    c <- result
}()
result := <-c

It would be very nice if there were some equivalent on Linux, but as far as I can tell that functionality (operating on the current thread) was dropped along the way from rfork to clone.

rsc commented 8 years ago

(Clarifying title, not making a decision.)

mdempsky commented 8 years ago

@rsc For what it's worth, Linux has unshare() and setns() system calls that operate on the current process/thread:

rsc commented 8 years ago

@mdempsky Thanks. Those look promising.

@sargun, can you take a look at those system calls and see if that works for you? The pattern would be something like:

runtime.LockOSThread()
unshare(2) to get private name space for thread
do bind/mounts
ForkExec or cmd.Start
setns(2) to reconnect to original name space
runtime.UnlockOSThread()

You can do the sequence in a goroutine if you want to make sure not to affect a possible thread lock in the caller.

corhere commented 2 years ago

While it is possible to unshare(2) a thread's mount namespace to set up mounts for the child process, it is not possible to restore the thread completely to its initial state afterwards. unshare(CLONE_NEWNS) implies unshare(CLONE_FS), unsharing the thread's working directory, root directory and umask attributes. The thread's mount namespace can be restored with setns(2), but there is no syscall to reverse the effects of unshare(CLONE_FS). Insidious bugs could manifest if arbitrary goroutines were to execute on a thread with unshared file system attributes so the thread would have to be terminated. The changes to runtime.LockOSThread introduced in Go 1.10 make this possible, but the runtime would be down a thread which will eventually have to be replaced. No thread would have to be terminated if the mounts could be performed between the fork (i.e. clone(CLONE_NEWNS)) and execve.

In addition, it would be amazing if all the mount operations could be configured, including overriding the mount which is currently performed unconditionally if UnshareFlags has CLONE_NEWNS set. Setting the mount propagation mode recursively to MS_PRIVATE can prevent filesystems from being unmounted from the initial mount namespace due to "dangling" mounts in the unshared namespace preventing the filesystem from being released. Setting the propagation mode to MS_SLAVE would prevent such issues, but unconditionally changing it in the runtime would likely violate the Go 1 Compatibility Promise.

golang / go

syscall: add bind/mount operations to SysProcAttr on Linux #12125