rewrite in Go or in Rust

AkihiroSuda commented 2 years ago

Go version could be easily implemented just by forking from https://github.com/opencontainers/runc/tree/v1.1.0/contrib/cmd/seccompagent

rata commented 2 years ago

@AkihiroSuda hey, some questions to understand how an agent for this will look like.

The current code handles the connect(2) syscall and it basically creates a new socket identical to the one used and allows the application to continue. So, the agent will probably have to do just that, right? That is quite trivial to implement, indeed (probably just need to define the addfd thingy in C, as the go bindings don't have that yet?).

However, some other questions to make sure I'm not missing something:

The seccomp agent runs in the host network namespace, so in a way the container is using something similar to hostNetwork (with some "filtering" to disallow connections to 127.0 and such). What is the advantage of this over using hostNetwork exactly? Can we use hostNetwork with rootless or it doesn't work?
Wouldn't some restrictions be ineffective as the agent creates a socket in the host network that later passes? Maybe traffic shaping, or things like that won't be effective?
Will the container be able to handle incoming connection (let's say it is an nginx server)? I'm not sure I see how those connections will be routed to the container, is there an IP allocated and routed? Is that connection over slow slirp or how does it work (in rootless scenarios)?

AkihiroSuda commented 2 years ago

What is the advantage of this over using hostNetwork exactly?

hostNetwork is insecure because it allows containers to connect to host abstract sockets (dbus, ibus, X11,...) and host loopback addresses. I wrote a blog about this issue. https://medium.com/nttlabs/dont-use-host-network-namespace-f548aeeef575

Can we use hostNetwork with rootless or it doesn't work?

Rootless Podman supports hostNetwork but it is insecure as explained above.

Rootless Docker and Rootless containerd do not support hostNetwork because dockerd and containerd are already executed inside netns (for ease of implementation).

Wouldn't some restrictions be ineffective as the agent creates a socket in the host network that later passes? Maybe traffic shaping, or things like that won't be effective?

Yes, and the users will have to be aware of that. Perhaps, docker/podman/nerdctl should execute bypass4netns in the container's cgroup to enforce cgroup-scoped limitations.

Will the container be able to handle incoming connection (let's say it is an nginx server)? I'm not sure I see how those connections will be routed to the container, is there an IP allocated and routed? Is that connection over slow slirp or how does it work (in rootless scenarios)?

More news to come, very soon

cc @naoki9911

rata commented 2 years ago

What is the advantage of this over using hostNetwork exactly?

hostNetwork is insecure because it allows containers to connect to host abstract sockets (dbus, ibus, X11,...) and host loopback addresses. I wrote a blog about this issue. https://medium.com/nttlabs/dont-use-host-network-namespace-f548aeeef575

Oh, I had read that same blog post you wrote long ago. Good point, thanks! :)

Can we use hostNetwork with rootless or it doesn't work?

Rootless Podman supports hostNetwork but it is insecure as explained above.

Rootless Docker and Rootless containerd do not support hostNetwork because dockerd and containerd are already executed inside netns (for ease of implementation).

Ohh, good to know. Thanks!

Wouldn't some restrictions be ineffective as the agent creates a socket in the host network that later passes? Maybe traffic shaping, or things like that won't be effective?

Yes, and the users will have to be aware of that. Perhaps, docker/podman/nerdctl should execute bypass4netns in the container's cgroup to enforce cgroup-scoped limitations.

That seems interesting. I'm not sure we will be able to not bypass any restriction by doing so. I remember Eric Dumazet mentioned these limitations on some Linux Plumbers talk and my memory from that is that it wasn't obvious if doing what you suggest will be enough.

Will the container be able to handle incoming connection (let's say it is an nginx server)? I'm not sure I see how those connections will be routed to the container, is there an IP allocated and routed? Is that connection over slow slirp or how does it work (in rootless scenarios)?

More news to come, very soon cc @naoki9911

haha, great! I assume for now the answer is that it can't receive incoming connections, right?

Also, now that I think about it, there are other limitations that come from doing this at connect(2) time. I think most can be solved if we handle socket(2) instead.

When we replace the fd at connect, it is already too late:

If the process added the fd to an epoll list (probably select and others fail too, but I've tested it in the past with epoll) then the new injected fd won't be used for that list. We found that some Java framework was using that pattern underneath, so it was not good for applications using it. The go runtime, with net.Dial and such, adds the fd to an epoll list just after running connect(2), so go apps were lucky.
If the fd is dup(), dup2() or whatever before the connect call, then dup() and friends won't have the expected effect
Probably any other fd specific operation that can be done before connect, if done, will fail/not behave as expected.

If we replace the socket at socket(2) time, not connect, none of the mentioned problems will happen. If you dup the socket, add it to epoll lists, etc. it will all just work.

To inject it at socket(2) time safely, though, we need to use SECCOMP_ADDFD_FLAG_SEND in the addfd call. I added that flag to the kernel due to a race condition you can easily hit otherwise: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.17-rc2&id=0ae71c7720e3ae3aabd2e8a072d27f7bd173d25c.

With connect, as we do today, that flag is not needed as we replace always the same fd, so we don't inadvertently leak fds. With socket, as we won't put any new_fd field (just using 0 will allocate a free socket safely, as if they were calling socket themselves, lowest free int, etc.), we need to use that flag to not have the race.

Also, today we only handle the connect syscall, so for UDP sockets this won't work (in case they are not using connect, that is quite common for UDP sockets. If the UDP socket is using connect, as they can and rarely do, it will work). Probably for other connection-less protocols this won't work either. But if we inject the socket at socket(2) time, it should work for UDP too, right? We will need to check the socket domain, family and all in the agent, of course. But should work IIUC. Am I missing something?

So, what do you think of rewriting this to use runc 1.1, a seccomp agent in golang (using the example agent as base, or https://github.com/kinvolk/seccompagent/) and handling the socket(2) syscall instead of connect?

rata commented 2 years ago

@AkihiroSuda sorry, forgot to tag you! ^

AkihiroSuda commented 2 years ago

haha, great! I assume for now the answer is that it can't receive incoming connections, right?

Now possible with #9 . Thanks to @naoki9911 !

rewriting this to use runc 1.1, a seccomp agent in golang

Done in #9 .

handling the socket(2) syscall instead of connect?

How can we support accept() with that?

rata commented 2 years ago

@AkihiroSuda @naoki9911 cool!

@AkihiroSuda to support accept, I think we don't need to do anything. The container just runs accept...What is the issue with that?

I mean, if we inject the fd when the container calla socket(2), then the fd is an fd created from the host netns and accept will probably just work? Not sure if we might need to handle listen too, but I think accept would work. Wouldn't it?

AkihiroSuda commented 2 years ago

The problem is that we want to accept from processes inside the netns, as well as processes outside the netns.

i.e., when we run nerdctl run -p 80:80 --name foo --net-opt bypass4netns nginx, the nginx should be accessible from the processes in the same container (foo) too.

rata commented 2 years ago

Why accept process from inside the netns? If the process inside the container, when calling socket, returns a socket from the host netns, it will just work when it runs connect, right?

rootless-containers / bypass4netns

rewrite in Go or in Rust #1