Open outofforest opened 2 years ago
This is supposed to be done by using the Unshareflags
field of syscall.SysProcAttr
. Why does that not suffice? Thanks.
This is supposed to be done by using the
Unshareflags
field ofsyscall.SysProcAttr
. Why does that not suffice? Thanks.
Behavior is the same no matter if I use Cloneflags
or Unshareflags
. They both work if only root uid is mapped but fails if I try to map subids too. This is by design because that's the limitation of linux unshare
for unprivileged user. If I try to map both scopes I get error: fork/exec /tmp/executor-468678108: operation not permitted
.
Software like podman
use alternative implementation of exec.Cmd
which calls newuidmap
, newgidmap
, reexec
and C wrapper to bypass this limitation. It would be nice to have it implemented in go directly.
What would we have to change to make this work? Thanks.
What would we have to change to make this work? Thanks.
I believe ready to use implementation is here: https://github.com/containers/storage/tree/main/pkg/unshare/
BTW: the best approach IMO would be to implement fork
. Yes, I know it's dangerous and golang team refused to do it many time but actually this is the best way to switch to namespace:
unshare
Whenever someone shares this idea golang team answers that the correct way of doing this is to use exec.Cmd
. But:
Imagine this situation:
I want to create a library which allows to do some operations inside namespace.
To do it now I have to run separate binary using exec.Cmd
. All existing software do that by reexecuting /proc/self/exe
. But this is not a good way if you want to create a library, reexecuting current binary from a library is extremely weird idea. Now I do it by embedding binary code inside library, saving it later in tmp location and executing it in a namespace. This is sooooo weird solution but the only one available because fork
is not present.
It's infeasible to implement fork
in the Go standard library. After a program forks, there is only a single thread running in the child process. That means that the basic assumptions of the Go runtime fall apart. Almost any ordinary Go operation can potentially fail in that case. The syscall code that starts a child is very carefully written to avoid problems. We can't permit ordinary Go code to run between the fork
and the execve
.
Thanks for pointing to the package. There is a lot there, and I don't know what matters. It would help this proposal a great deal if you or somebody could write down exactly what would need to be added to syscall.ProcSysAttr
to make things work. In your initial comment you mentioned Cloneflags
, UidMappings
, and GidMappings
, but all of those already exist. What do we need that is new?
@ianlancetaylor
I'm not a C expert but linked library works this way:
exec.Cmd
. It reexecutes /proc/self/exe
so the same binary is executed again in this case, but in general one it should allow to start any binary.// #cgo CFLAGS: -Wall -Wextra
// extern void _containers_unshare(void);
// void __attribute__((constructor)) init(void) {
// _containers_unshare();
// }
import "C"
_containers_unshare
C function does the real stuff related to creating a namespace: https://github.com/containers/storage/blob/main/pkg/unshare/unshare.c#L299And this is the flow:
newuidmap
and newgidmap
linux binaries to set full mappings (both root user and subids if defined in /etc/subuid
and /etc/subgid
): https://github.com/containers/storage/blob/main/pkg/unshare/unshare_linux.go#L266
https://github.com/containers/storage/blob/main/pkg/unshare/unshare_linux.go#L217
It may be done by unprivileged user because newuidmap
and newgidmap
have appropriate capability or sticky bit set, depending on linux distro.In this setup flags for unshare
are passed using _Containers-unshare
env variable:
https://github.com/containers/storage/blob/main/pkg/unshare/unshare_linux.go#L86
https://github.com/containers/storage/blob/main/pkg/unshare/unshare.c#L304
Entire process is sooooo complicated...
It would be much easier if go executable could call unshare
before real go runtime starts.
We are not going to permit calling ordinary Go code between fork
and execve
.
I don't know what the newuidmap
and newgidmap
binaries are.
For this proposal to move forward we're going to need more precise details as to exactly what needs to be implemented in the syscall package. I'm going to put this on hold for now.
I don't know what the
newuidmap
andnewgidmap
binaries are.
https://man7.org/linux/man-pages/man1/newuidmap.1.html https://man7.org/linux/man-pages/man1/newgidmap.1.html
For this proposal to move forward we're going to need more precise details as to exactly what needs to be implemented in the syscall package. I'm going to put this on hold for now.
I think I provided all the details on how the flow should look like. But once again:
newuidmap
and newgidmap
in parent processunshare
in child process before go runtime takes control over executiongo is frequently used to develop containerization engines so it would be nice to implement this once for all.
Why is it necessary to call an external binary to make this work? Why can't we do this entirely with system calls?
Why is it necessary to call an external binary to make this work? Why can't we do this entirely with system calls?
Unprivileged user may create only a single mapping: id in container -> current uid/gid on host
. Providing more mappings causes "permission denied" error, even if mapped ids are specified in /etc/subuid
or /etc/subgid
. This is by design in linux.
Calling newuidmap
/ newgidmap
bypasses this limitation because those binaries have sticky bit or capability set (depending on distro) so they always run with privileges required to set extended mappings.
Hey everyone, I tried to come up with an implementation for this issue: https://github.com/golang/go/compare/master...hown3d:go:master
I tried to follow the logic of util-linux's implementation of unshare, although I ran into an issue: The newuidmap and newgidmap executables need to be fork/exec'ed while the child of our current fork is waiting for the idmap writes. Since the previous fork didn't return yet, the ForkLock will still be locked and we run into a deadlock. https://github.com/golang/go/blob/29b9a328d268d53833d2cc063d1d8b4bf6852675/src/syscall/exec_unix.go#L200-L216
I think if that problem is sorted out, the implementation can go on further. Can someone with a bit more experience on the stdlib help out?
I've been running into operation not permitted
anytime I want to do anything advanced when setting UidMappings (note: uid/gid interchangeable in this post) similar to https://github.com/golang/go/issues/50098#issuecomment-991554095
A little debugging reveals it comes from when go tries to write to the child process's /proc/{childpid}/uid_map
, the write returns EPERM
. Looking through user_namespaces(7)
, it basically list many reasons why this can happen. However many of the (most likely) reasons deal with capabilities.
So to best logically deduce what's happening, I looked at the source code for the newuidmap(1)
utility, as this utility worked just fine when setting uid mappings. I ended up here. When I compiled this utility without the HAVE_SYS_CAPABILITY_H
flag then it fails with write returning EPERM
, the same as go.
Conclusively, in a non-root environment, doing anything advanced with UidMappings will always fail with EPERM until capabilities are set prior to writeUidGidMappings
being called.
In other words, we need to implement the go-equivalent of this mess:
int cap;
struct __user_cap_header_struct hdr = {_LINUX_CAPABILITY_VERSION_3, 0};
struct __user_cap_data_struct data[2] = {{0}};
if (strcmp(map_file, "uid_map") == 0) {
cap = CAP_SETUID;
} else if (strcmp(map_file, "gid_map") == 0) {
cap = CAP_SETGID;
} else {
fprintf(stderr, _("%s: Invalid map file %s specified\n"), Prog, map_file);
exit(EXIT_FAILURE);
}
/* Align setuid- and fscaps-based new{g,u}idmap behavior. */
if (geteuid() == 0 && geteuid() != ruid) {
if (prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0) < 0) {
fprintf(stderr, _("%s: Could not prctl(PR_SET_KEEPCAPS)\n"), Prog);
exit(EXIT_FAILURE);
}
if (seteuid(ruid) < 0) {
fprintf(stderr, _("%s: Could not seteuid to %d\n"), Prog, ruid);
exit(EXIT_FAILURE);
}
}
/* Lockdown new{g,u}idmap by dropping all unneeded capabilities. */
memset(data, 0, sizeof(data));
data[0].effective = CAP_TO_MASK(cap);
data[0].permitted = data[0].effective;
if (capset(&hdr, data) < 0) {
fprintf(stderr, _("%s: Could not set caps\n"), Prog);
exit(EXIT_FAILURE);
}
1.5 year later, having more experience in running programs in namespaces using go I think there is no good solution for this, as the limitation comes from the kernel. There are 4 possible ways to deal with it:
newuidmap
and newgidmap
binaries to assign mappings, those binaries must be owned by root and have sticky bit set (this is how podman approaches the problem)My current understanding of this issue, which may be mistaken, is that we don't need to change any user visible API in the standard library. This issue is about the fact that the UIDMappings
and GIDMappings
fields only work if the program is running as root. It is possible to implement them by having the parent process run newuidmap
and/or newgidmap
to set up the mappings. This has to be done after the child process has forked but before the child process has exec'ed the new program.
Does that sound right? If that's right, this doesn't need to be a proposal.
(As an aside, I don't see why ForkLock
is a problem here. The parent process would be running in forkAndExecInChild
, and would cause forkAndExecInChild1
again specifically to run newuidmap
and newgidmap
.)
@ianlancetaylor
This has to be done after the child process has forked but before the child process has exec'ed the new program.
And this is the problem. Because, at the moment, we have no way to inject the logic into exec.Cmd
to do this.
If we know precisely what has to be done, we can change the syscall package to do it. We don't have to support having the program inject arbitrary code. Even if there were a reasonable way to do that, it's not what we want.
I agree that adding code calling newuidmap
or newgidmap
to standard library is not a good idea, but then developers should have a way to inject this logic into the right place. Otherwise ability to configure namespaces is limited.
I am actually suggesting that perhaps we could change the syscall package to invoke newuidmap
or newgidmap
when appropriate.
I have this scenario:
unshare
syscall allows me to maproot
user only.podman
is able to map subids too by callingnewuidmap
andnewgidmap
(it is implemented here: https://github.com/containers/storage/tree/main/pkg/unshare/unshare_linux.go)exec.Cmd
plus some C code to callunshare
in the middle of the process, after setting full mapping setsyscall.Unshare(syscall.CLONE_NEWUSER)
returnsinvalid argument
error even if called from goroutine pinned to thread.It would be nice to have more "goish" way to do it, like this: