Open azliu0 opened 21 hours ago
This is caused by a bug in the gofer filesystem client. One process creates a Unix domain socket file:
I1122 01:34:04.790152 1 strace.go:567] [ 91: 91] raylet E bind(0x23 socket:[56], 0x7ecfa65d44d0 {Family: AF_UNIX, Addr: "/tmp/ray/session_2024-11-22_01-34-01_491092_1/sockets/raylet"}, 0x3e)
I1122 01:34:04.790206 1 directfs_dentry.go:506] [ 91: 91] XXX in directfsDentry.bindAt(raylet)
I1122 01:34:04.790510 1 directfs_dentry.go:517] [ 91: 91] XXX directfsDentry.bindAt(raylet): controlFD=13 boundSocketFD=&{14 1442 824634530160}
I1122 01:34:04.790756 1 strace.go:605] [ 91: 91] raylet X bind(0x23 socket:[56], 0x7ecfa65d44d0 {Family: AF_UNIX, Addr: "/tmp/ray/session_2024-11-22_01-34-01_491092_1/sockets/raylet"}, 0x3e) = 0 (0x0) (561.104µs)
This creates a gofer.directfsDentry that retains the socket/unix/transport.HostBoundEndpoint representing the bound socket's host file descriptor. However, no reference is held on the dentry, so it can be evicted from the dentry cache, causing the endpoint to be lost. Another process connects to the socket before it is evicted:
I1122 01:34:04.898737 1 strace.go:567] [ 1: 1] python E connect(0x12 socket:[62], 0x7ed426e6bc50 {Family: AF_UNIX, Addr: "/tmp/ray/session_2024-11-22_01-34-01_491092_1/sockets/raylet"}, 0x3e)
I1122 01:34:04.898755 1 strace.go:570] [ 1: 163] worker.io E epoll_wait(0x10 anon_inode:[eventpoll], 0x7eddccff8170 {}, 0x80, 0xffffffff)
I1122 01:34:04.898783 1 strace.go:608] [ 1: 163] worker.io X epoll_wait(0x10 anon_inode:[eventpoll], 0x7eddccff8170 {{events=EPOLLIN data=[-0x43ff56e0, 0x7edd]}}, 0x80, 0xffffffff) = 1 (0x1) (4.286µs)
I1122 01:34:04.898812 1 strace.go:576] [ 148: 148] python E mmap(0x0, 0x40000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 0xffffffffffffffff (bad FD), 0x0)
I1122 01:34:04.898754 1 filesystem.go:1670] [ 1: 1] XXX gofer.filesystem.BoundEndpointAt: returning existing endpoint &{baseEndpoint:{Queue:0xc000ee2cc0 DefaultSocketOptionsHandler:{} endpointMutex:{mu:{m:{m:{state:0 sema:0}}}} receiver:<nil> connected:<nil> path:/tmp/ray/session_2024-11-22_01-34-01_491092_1/sockets/raylet ops:{handler:0xc000948000 stackHandler:0x20c6740 broadc>
I1122 01:34:04.898874 1 strace.go:605] [ 1: 1] python X connect(0x12 socket:[62], 0x7ed426e6bc50 {Family: AF_UNIX, Addr: "/tmp/ray/session_2024-11-22_01-34-01_491092_1/sockets/raylet"}, 0x3e) = 0 (0x0) (126.39µs)
After vigorous filesystem activity, the socket file's dentry is evicted, so when a later process connects to the same socket it gets a different endpoint:
I1122 01:34:08.672932 1 strace.go:567] [ 164: 164] ray::IDLE E connect(0x10 socket:[94], 0x7ecd34d90cd0 {Family: AF_UNIX, Addr: "/tmp/ray/session_2024-11-22_01-34-01_491092_1/sockets/raylet"}, 0x3e)
I1122 01:34:08.673009 1 filesystem.go:1676] [ 164: 164] XXX gofer.filesystem.BoundEndpointAt: returning new endpoint
I1122 01:34:08.673419 1 strace.go:605] [ 164: 164] ray::IDLE X connect(0x10 socket:[94], 0x7ecd34d90cd0 {Family: AF_UNIX, Addr: "/tmp/ray/session_2024-11-22_01-34-01_491092_1/sockets/raylet"}, 0x3e) = 0 (0x0) (435.392µs)
(The gofer filesystem client synthesizes a gofer.endpoint in this case for historical reasons, cl/146172912 internally.)
I don't think the HostBoundEndpoint can reasonably be recovered if the dentry is dropped, so I think the correct fix is to make dentries with an endpoint (or a pipe) unevictable by e.g. holding an extra reference that is dropped by UnlinkAt(), as we do for synthetic dentries. Simply leaking a reference is sufficient to fix the repro, but is obviously not a complete fix:
diff --git a/pkg/sentry/fsimpl/gofer/filesystem.go b/pkg/sentry/fsimpl/gofer/filesystem.go
index 18810d485..8bcec78c8 100644
--- a/pkg/sentry/fsimpl/gofer/filesystem.go
+++ b/pkg/sentry/fsimpl/gofer/filesystem.go
@@ -881,6 +881,8 @@ func (fs *filesystem) MknodAt(ctx context.Context, rp *vfs.ResolvingPath, opts v
return fs.doCreateAt(ctx, rp, false /* dir */, func(parent *dentry, name string, ds **[]*dentry) (*dentry, error) {
creds := rp.Credentials()
if child, err := parent.mknod(ctx, name, creds, &opts); err == nil {
+ // XXX
+ child.IncRef()
return child, nil
} else if !linuxerr.Equals(linuxerr.EPERM, err) {
return nil, err
Description
I was trying to run the ray python library in a container with container uds create permissions (i.e.,
-host-uds=create
or-host-uds=all
) and-overlay2=none
. I expect the initialization commandray.init()
to return quickly, but the command instead seems to hang. I observe the correct behavior of returning quickly when I use:-host-uds
that has no container create permissions, i.e., no flag or-host-uds=open
, along with-overlay2=none
-host-uds
with no overlay flagRunning the same command with
runc
produces the correct behavior of returning quickly.Running
gdb
in the container while it was running produced this outputindicating that the process was hanging while receiving a message over socket. Inspecting the fd reveals that this is a unix socket.
cc @thundergolfer @pawalt
Steps to reproduce
Dockerfile
Configure docker json
Configure the flags
-host-uds=all
and-overlay2=none
in/etc/docker/daemon.json
:I also observe the bug when
-host-uds=create
.Run with runsc
The expected behavior is that running the script should exit quickly. When running with
runsc
, I observe the following output before the process hangs:When running with
runc
, I observe the same output, but the process exits quickly.runsc version
docker version (if using docker)
uname
Linux ip-10-1-13-177.ec2.internal 5.15.0-301.163.5.2.el9uek.x86_64 #2 SMP Wed Oct 16 18:55:42 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
repo state (if built from source)
N/A
runsc debug logs (if available)