proposal: os/exec: better support for user namespaces

outofforest commented 2 years ago

I have this scenario:

I want to start new process inside new user namespace (aka rootless container).
unshare syscall allows me to map root user only.
podman is able to map subids too by calling newuidmap and newgidmap (it is implemented here: https://github.com/containers/storage/tree/main/pkg/unshare/unshare_linux.go)
To make it work they implemented complex wrapper around exec.Cmd plus some C code to call unshare in the middle of the process, after setting full mapping set
For some reason calling syscall.Unshare(syscall.CLONE_NEWUSER) returns invalid argument error even if called from goroutine pinned to thread.

It would be nice to have more "goish" way to do it, like this:

    cmd.SysProcAttr = &syscall.SysProcAttr{
        Cloneflags: syscall.CLONE_NEWUSER,
        UidMappings: []syscall.SysProcIDMap{
            {
                HostID:      1000,
                ContainerID: 0,
                Size:        1,
            },
            {
                HostID:      1,
                ContainerID: 100000,
                Size:        65536,
            },
        },
        GidMappings: []syscall.SysProcIDMap{
            {
                HostID:      1000,
                ContainerID: 0,
                Size:        1,
            },
            {
                HostID:      1,
                ContainerID: 100000,
                Size:        65536,
            },
        },
    }

ianlancetaylor commented 2 years ago

This is supposed to be done by using the Unshareflags field of syscall.SysProcAttr. Why does that not suffice? Thanks.

outofforest commented 2 years ago

This is supposed to be done by using the Unshareflags field of syscall.SysProcAttr. Why does that not suffice? Thanks.

Behavior is the same no matter if I use Cloneflags or Unshareflags. They both work if only root uid is mapped but fails if I try to map subids too. This is by design because that's the limitation of linux unshare for unprivileged user. If I try to map both scopes I get error: fork/exec /tmp/executor-468678108: operation not permitted.

Software like podman use alternative implementation of exec.Cmd which calls newuidmap, newgidmap, reexec and C wrapper to bypass this limitation. It would be nice to have it implemented in go directly.

ianlancetaylor commented 2 years ago

What would we have to change to make this work? Thanks.

outofforest commented 2 years ago

What would we have to change to make this work? Thanks.

I believe ready to use implementation is here: https://github.com/containers/storage/tree/main/pkg/unshare/

outofforest commented 2 years ago

BTW: the best approach IMO would be to implement fork. Yes, I know it's dangerous and golang team refused to do it many time but actually this is the best way to switch to namespace:

Fork the process
Inside parent set id mappings
Inside child call unshare

Whenever someone shares this idea golang team answers that the correct way of doing this is to use exec.Cmd. But:

as I shown you above, it doesn't work anyway
it makes us impossible to provide "container as a library".

Imagine this situation: I want to create a library which allows to do some operations inside namespace. To do it now I have to run separate binary using exec.Cmd. All existing software do that by reexecuting /proc/self/exe. But this is not a good way if you want to create a library, reexecuting current binary from a library is extremely weird idea. Now I do it by embedding binary code inside library, saving it later in tmp location and executing it in a namespace. This is sooooo weird solution but the only one available because fork is not present.

ianlancetaylor commented 2 years ago

It's infeasible to implement fork in the Go standard library. After a program forks, there is only a single thread running in the child process. That means that the basic assumptions of the Go runtime fall apart. Almost any ordinary Go operation can potentially fail in that case. The syscall code that starts a child is very carefully written to avoid problems. We can't permit ordinary Go code to run between the fork and the execve.

Thanks for pointing to the package. There is a lot there, and I don't know what matters. It would help this proposal a great deal if you or somebody could write down exactly what would need to be added to syscall.ProcSysAttr to make things work. In your initial comment you mentioned Cloneflags, UidMappings, and GidMappings, but all of those already exist. What do we need that is new?

outofforest commented 2 years ago

@ianlancetaylor

I'm not a C expert but linked library works this way:

New process is started using exec.Cmd. It reexecutes /proc/self/exe so the same binary is executed again in this case, but in general one it should allow to start any binary.
Executed go binary is prepared in some tricky way. As far as I understand, it runs C function before go runtime takes control: https://github.com/containers/storage/blob/main/pkg/unshare/unshare_gccgo.go#L5
```
// #cgo CFLAGS: -Wall -Wextra
// extern void _containers_unshare(void);
// void __attribute__((constructor)) init(void) {
//   _containers_unshare();
// }
import "C"
```
_containers_unshare C function does the real stuff related to creating a namespace: https://github.com/containers/storage/blob/main/pkg/unshare/unshare.c#L299

And this is the flow:

Binary is reexecuted so we have a parent and child.
Child starts the C code, sends its PID to parent (https://github.com/containers/storage/blob/main/pkg/unshare/unshare.c#L319) and waits (https://github.com/containers/storage/blob/main/pkg/unshare/unshare.c#L328).
Parent takes child process ID and calls newuidmap and newgidmap linux binaries to set full mappings (both root user and subids if defined in /etc/subuid and /etc/subgid): https://github.com/containers/storage/blob/main/pkg/unshare/unshare_linux.go#L266 https://github.com/containers/storage/blob/main/pkg/unshare/unshare_linux.go#L217 It may be done by unprivileged user because newuidmap and newgidmap have appropriate capability or sticky bit set, depending on linux distro.
Now child gets a signal (using pipe) to continue.
Child reexecutes itself again (this time in C code): https://github.com/containers/storage/blob/main/pkg/unshare/unshare.c#L372
Finally after 2 reexecs we have go binary running inside user namespce with both root and subids being mapped.

In this setup flags for unshare are passed using _Containers-unshare env variable: https://github.com/containers/storage/blob/main/pkg/unshare/unshare_linux.go#L86 https://github.com/containers/storage/blob/main/pkg/unshare/unshare.c#L304

Entire process is sooooo complicated... It would be much easier if go executable could call unshare before real go runtime starts.

ianlancetaylor commented 2 years ago

We are not going to permit calling ordinary Go code between fork and execve.

I don't know what the newuidmap and newgidmap binaries are.

For this proposal to move forward we're going to need more precise details as to exactly what needs to be implemented in the syscall package. I'm going to put this on hold for now.

outofforest commented 2 years ago

I don't know what the newuidmap and newgidmap binaries are.

https://man7.org/linux/man-pages/man1/newuidmap.1.html https://man7.org/linux/man-pages/man1/newgidmap.1.html

For this proposal to move forward we're going to need more precise details as to exactly what needs to be implemented in the syscall package. I'm going to put this on hold for now.

I think I provided all the details on how the flow should look like. But once again:

fork
call newuidmap and newgidmap in parent process
call unshare in child process before go runtime takes control over execution

go is frequently used to develop containerization engines so it would be nice to implement this once for all.

ianlancetaylor commented 2 years ago

Why is it necessary to call an external binary to make this work? Why can't we do this entirely with system calls?

outofforest commented 2 years ago

Why is it necessary to call an external binary to make this work? Why can't we do this entirely with system calls?

Unprivileged user may create only a single mapping: id in container -> current uid/gid on host. Providing more mappings causes "permission denied" error, even if mapped ids are specified in /etc/subuid or /etc/subgid. This is by design in linux. Calling newuidmap / newgidmap bypasses this limitation because those binaries have sticky bit or capability set (depending on distro) so they always run with privileges required to set extended mappings.

hown3d commented 2 years ago

Hey everyone, I tried to come up with an implementation for this issue: https://github.com/golang/go/compare/master...hown3d:go:master

I tried to follow the logic of util-linux's implementation of unshare, although I ran into an issue: The newuidmap and newgidmap executables need to be fork/exec'ed while the child of our current fork is waiting for the idmap writes. Since the previous fork didn't return yet, the ForkLock will still be locked and we run into a deadlock. https://github.com/golang/go/blob/29b9a328d268d53833d2cc063d1d8b4bf6852675/src/syscall/exec_unix.go#L200-L216

I think if that problem is sorted out, the implementation can go on further. Can someone with a bit more experience on the stdlib help out?

kevin-matthew commented 1 year ago

I've been running into operation not permitted anytime I want to do anything advanced when setting UidMappings (note: uid/gid interchangeable in this post) similar to https://github.com/golang/go/issues/50098#issuecomment-991554095

A little debugging reveals it comes from when go tries to write to the child process's /proc/{childpid}/uid_map, the write returns EPERM. Looking through user_namespaces(7), it basically list many reasons why this can happen. However many of the (most likely) reasons deal with capabilities.

So to best logically deduce what's happening, I looked at the source code for the newuidmap(1) utility, as this utility worked just fine when setting uid mappings. I ended up here. When I compiled this utility without the HAVE_SYS_CAPABILITY_H flag then it fails with write returning EPERM, the same as go.

Conclusively, in a non-root environment, doing anything advanced with UidMappings will always fail with EPERM until capabilities are set prior to writeUidGidMappings being called.

In other words, we need to implement the go-equivalent of this mess:

    int cap;
    struct __user_cap_header_struct hdr = {_LINUX_CAPABILITY_VERSION_3, 0};
    struct __user_cap_data_struct data[2] = {{0}};

    if (strcmp(map_file, "uid_map") == 0) {
        cap = CAP_SETUID;
    } else if (strcmp(map_file, "gid_map") == 0) {
        cap = CAP_SETGID;
    } else {
        fprintf(stderr, _("%s: Invalid map file %s specified\n"), Prog, map_file);
        exit(EXIT_FAILURE);
    }

    /* Align setuid- and fscaps-based new{g,u}idmap behavior. */
    if (geteuid() == 0 && geteuid() != ruid) {
        if (prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0) < 0) {
            fprintf(stderr, _("%s: Could not prctl(PR_SET_KEEPCAPS)\n"), Prog);
            exit(EXIT_FAILURE);
        }

        if (seteuid(ruid) < 0) {
            fprintf(stderr, _("%s: Could not seteuid to %d\n"), Prog, ruid);
            exit(EXIT_FAILURE);
        }
    }

    /* Lockdown new{g,u}idmap by dropping all unneeded capabilities. */
    memset(data, 0, sizeof(data));
    data[0].effective = CAP_TO_MASK(cap);
    data[0].permitted = data[0].effective;
    if (capset(&hdr, data) < 0) {
        fprintf(stderr, _("%s: Could not set caps\n"), Prog);
        exit(EXIT_FAILURE);
    }

outofforest commented 1 year ago

1.5 year later, having more experience in running programs in namespaces using go I think there is no good solution for this, as the limitation comes from the kernel. There are 4 possible ways to deal with it:

Run your software as root
Assign some extra capabilities to the binary (haven't tested but should be possible)
Run a magic daemon as root to serve non-root clients (this is what docker does)
Execute newuidmap and newgidmap binaries to assign mappings, those binaries must be owned by root and have sticky bit set (this is how podman approaches the problem)

ianlancetaylor commented 1 year ago

My current understanding of this issue, which may be mistaken, is that we don't need to change any user visible API in the standard library. This issue is about the fact that the UIDMappings and GIDMappings fields only work if the program is running as root. It is possible to implement them by having the parent process run newuidmap and/or newgidmap to set up the mappings. This has to be done after the child process has forked but before the child process has exec'ed the new program.

Does that sound right? If that's right, this doesn't need to be a proposal.

(As an aside, I don't see why ForkLock is a problem here. The parent process would be running in forkAndExecInChild, and would cause forkAndExecInChild1 again specifically to run newuidmap and newgidmap.)

outofforest commented 1 year ago

@ianlancetaylor

This has to be done after the child process has forked but before the child process has exec'ed the new program.

And this is the problem. Because, at the moment, we have no way to inject the logic into exec.Cmd to do this.

ianlancetaylor commented 1 year ago

If we know precisely what has to be done, we can change the syscall package to do it. We don't have to support having the program inject arbitrary code. Even if there were a reasonable way to do that, it's not what we want.

outofforest commented 1 year ago

I agree that adding code calling newuidmap or newgidmap to standard library is not a good idea, but then developers should have a way to inject this logic into the right place. Otherwise ability to configure namespaces is limited.

ianlancetaylor commented 1 year ago

I am actually suggesting that perhaps we could change the syscall package to invoke newuidmap or newgidmap when appropriate.

golang / go

proposal: os/exec: better support for user namespaces #50098