opencontainers / runc

CLI tool for spawning and running containers according to the OCI specification
https://www.opencontainers.org/
Apache License 2.0
11.58k stars 2.06k forks source link

[v1.2 regression] `SCMP_ACT_NOTIFY` rule for `fcntl` causes runc to hang, before connecting to the seccomp listener agent #4328

Open AkihiroSuda opened 3 days ago

AkihiroSuda commented 3 days ago

nerdctl run --rm --annotation nerdctl/bypass4netns=1 alpine echo hi works with runc v1.1.13, but hangs with runc v1.2.0-rc.2.

Reproduction steps:

AkihiroSuda commented 3 days ago

Looks like seccomping fcntl with SCMP_ACT_NOTIFY is no longer possible with runc v1.2 https://github.com/rootless-containers/bypass4netns/blob/v0.4.1/pkg/oci/oci.go#L14


Minimal repro:

AkihiroSuda commented 3 days ago

Regression in 20b95f23ca33cb1751ee0b5318e1d5f3676032c6 libcontainer: seccomp: pass around *os.File for notifyfd (PR #3982)

Seems to be related to https://cs.opensource.google/go/go/+/refs/tags/go1.22.4:src/os/file_unix.go;l=113

rata commented 3 days ago

This is borderline for me. Is it a regression? I mean, I feel like we should be allowed to change which syscalls we do after the seccomp policies are applied and before the container entrypoint exec. Especially since we are using go, that we don't have so much control on which syscalls are used.

In fact, this is a great example, it is not documented that os.File uses fcntl and they can probably change it in different versions of go. So it's hard for us to make promises, IMHO, when using such a high-level language.

Can't this be handled properly in the seccomp agent? I think it should be easy to handle if the agent does: (a) accept all the syscalls on fds it doesn't know about (those where created before the agent received the message on the unix socket, so those are runc fds), or (b) accept all the syscalls before the first exec.

I think one of those should be possible (IMHO the latter is more robust, I'd go for that)

@AkihiroSuda what do you think?

AkihiroSuda commented 3 days ago

cc @naoki9911 (maintainer of https://github.com/rootless-containers/bypass4netns)

accept all the syscalls before the first exec.

How can the agent detect the first exec?

rata commented 3 days ago

@AkihiroSuda if you notify on the exec syscall, then you can see if exec was done or not and accept all before processing an exec, after the exec do whatever the agent wants to do.

If you do not want to handle exec, then just accepting all on unknown fds will do the trick too for this issue.

rata commented 3 days ago

But I honestly don't think this is something we want to fix in runc. I'm okay if the PR to fix this is very small, but it seems like the agent should handle this better (also, maybe the agent doesn't work with youki or other runtimes until it has a more robust handling of this).

In fact, looking at the golang code you shared (https://cs.opensource.google/go/go/+/refs/tags/go1.22.4:src/os/file_unix.go;l=113) if only the agent returns an error, instead of hanging the syscall forever, it might just work fine. If fcntl returns an error, the runtime doesn't abort it at all in that function.

I do think a better handling in the agent as I proposed in the comments above is better, but even this small fix might just cut it for this case.

naoki9911 commented 2 days ago

I confirmed the issue.

bypass4netns always allows fcntl exec. It only records fcntl's arguments.

With runc v 1.1.13, bypass4netns hooks fcntl correctly.

$ .local/bin/runc -v
runc version 1.1.13
commit: v1.1.13-0-g58aa9203
spec: 1.0.2-dev
go: go1.22.4
libseccomp: 2.5.1

$ rootlesskit ./.local/bin/runc run foo
exec /bin/sh: no such file or directory
[rootlesskit:child ] error: command [./.local/bin/runc run foo] exited: exit status 1
[rootlesskit:parent] error: child exited: exit status 1

$ ./bypass4netns --trace
INFO[0000] Trace mode enabled
INFO[0000] SocketPath: /run/user/1002/bypass4netns.sock
INFO[0000] 127.0.0.0/8 is added to ignore
INFO[0000] Waiting for seccomp file descriptors
INFO[0014] accept connection
INFO[0014] Received new seccomp fd: 8
INFO[0014] tracer is disabled
INFO[0014] background task is ready. start to handle
TRAC[0014] Received syscall "fcntl", pid 564221, arch "amd64", args [4 3 0 0 0 0]
DEBU[0014] process is registered                         pid=564221 sockfd=4 syscall=fcntl
DEBU[0014] got sockfd=10
DEBU[0014] failed to get socket args err="getsockopt(SO_DOMAIN) failed: socket operation on non-socket"  pid=564221 sockfd=4 syscall=fcntl
DEBU[0014] socket is registered (state=NotBypassable)    pid=564221 sockfd=4 syscall=fcntl
TRAC[0014] Received syscall "fcntl", pid 564221, arch "amd64", args [4 4 34816 0 0 0]
TRAC[0014] Received syscall "fcntl", pid 564221, arch "amd64", args [4 3 0 0 0 0]
TRAC[0014] Received syscall "fcntl", pid 564221, arch "amd64", args [4 4 32768 0 0 0]

But with runc v1.2.0-rc.2, bypass4nets does not receive any notifications.

$ ./runc-amd64-v1.2.0-rc.2 -v
runc version 1.2.0-rc.2
commit: v1.2.0-rc.2-0-gf2d2ee5e-dirty
spec: 1.2.0
go: go1.22.3
libseccomp: 2.5.5

$ rootlesskit ./runc-amd64-v1.2.0-rc.2 run foo
<- Hangs here

$ ./bypass4netns --trace
INFO[0000] Trace mode enabled
INFO[0000] SocketPath: /run/user/1002/bypass4netns.sock
INFO[0000] 127.0.0.0/8 is added to ignore
INFO[0000] Waiting for seccomp file descriptors
<- No new notify connection received.

bypass4netns does not receive notify fd with runc v1.2.0-rc.2. I changed the hooking syscall to connect, bypass4netns successfully accepts notify fd.

      "syscalls": [
        {
          "names": [
-            "fcntl"
+           "connect"
         ],
          "action": "SCMP_ACT_NOTIFY"
        }
      ]
$ rootlesskit ./runc-amd64-v1.2.0-rc.2 run foo
exec /bin/sh: no such file or directory
[rootlesskit:child ] error: command [./runc-amd64-v1.2.0-rc.2 run foo] exited: exit status 255
[rootlesskit:parent] error: child exited: exit status 255

$ ./bypass4netns --trace
INFO[0000] Trace mode enabled
INFO[0000] SocketPath: /run/user/1002/bypass4netns.sock
INFO[0000] 127.0.0.0/8 is added to ignore
INFO[0000] Waiting for seccomp file descriptors
INFO[0002] accept connection <- accepted new connection.
INFO[0002] Received new seccomp fd: 8
INFO[0002] tracer is disabled
INFO[0002] background task is ready. start to handle

I think this issue occurs in the establishment of seccomp notification connection. Filtering fcntl may be causing some 'dead-lock' state in it.

rata commented 8 hours ago

Thinking about it more, we can probably break other setups changing the syscalls we do at that point. I think it is unlikely we do, but we can check if it's easy to fix this. Wanna take a look and open a PR?

(for the agent, though, I think the changes we talked about are the way to go anyways)