google / gvisor

Application Kernel for Containers
https://gvisor.dev
Apache License 2.0
15.86k stars 1.3k forks source link

ray.init() unexpectedly hangs when -host-uds=all and -overlay2=none #11202

Open azliu0 opened 21 hours ago

azliu0 commented 21 hours ago

Description

I was trying to run the ray python library in a container with container uds create permissions (i.e., -host-uds=create or -host-uds=all) and -overlay2=none. I expect the initialization command ray.init() to return quickly, but the command instead seems to hang. I observe the correct behavior of returning quickly when I use:

Running the same command with runc produces the correct behavior of returning quickly.

Running gdb in the container while it was running produced this output

#0  __libc_recv (flags=<optimized out>, len=8, buf=0x7edf2690aa00, fd=25)
    at ../sysdeps/unix/sysv/linux/recv.c:28
28      in ../sysdeps/unix/sysv/linux/recv.c
(gdb) bt
#0  __libc_recv (flags=<optimized out>, len=8, buf=0x7edf2690aa00, fd=25)
    at ../sysdeps/unix/sysv/linux/recv.c:28
#1  __libc_recv (fd=25, buf=0x7edf2690aa00, len=8, flags=0)
    at ../sysdeps/unix/sysv/linux/recv.c:23
#2  0x00007ea8237b9f0f in boost::asio::detail::socket_ops::recv1(int, void*, unsigned long, int, boost::system::error_code&) ()
   from /usr/local/lib/python3.11/site-packages/ray/_raylet.so
#3  0x00007ea8237bac42 in boost::asio::detail::socket_ops::sync_recv1(int, unsigned char, void*, unsigned long, int, boost::system::error_code&) ()
   from /usr/local/lib/python3.11/site-packages/ray/_raylet.so
#4  0x00007ea8231c1b18 in ray::ServerConnection::ReadBuffer(std::vector<boost::asio::mutable_buffer, std::allocator<boost::asio::mutable_buffer> > const&) ()
   from /usr/local/lib/python3.11/site-packages/ray/_raylet.so
#5  0x00007ea8231c4ef2 in ray::ServerConnection::ReadMessage(long, std::vector<unsigned char, std::allocator<unsigned char> >*) ()
   from /usr/local/lib/python3.11/site-packages/ray/_raylet.so
#6  0x00007ea822f92ef8 in ray::raylet::RayletConnection::AtomicRequestReply(ray::protocol::MessageType, ray::protocol::MessageType, std::vector<unsigned char, std::allocator<unsigned char> >*, flatbuffers::FlatBufferBuilder*) ()
   from /usr/local/lib/python3.11/site-packages/ray/_raylet.so
#7  0x00007ea822f93d29 in ray::raylet::RayletClient::RayletClient(instrumented_io_context&, std::shared_ptr<ray::rpc::NodeManagerWorkerClient>, std::string const&, ray::WorkerID const&, ray::rpc::WorkerType, ray::JobID const&, int const&, ray::rpc::Language const&, std::string const&, ray::Status*, ray::NodeID*, int*, std::string const&, --Type <RET> for more, q to quit, c to continue without paging--
long) () from /usr/local/lib/python3.11/site-packages/ray/_raylet.so
#8  0x00007ea822ed611b in ray::core::CoreWorker::CoreWorker(ray::core::CoreWorkerOptions const&, ray::WorkerID const&) ()
   from /usr/local/lib/python3.11/site-packages/ray/_raylet.so

indicating that the process was hanging while receiving a message over socket. Inspecting the fd reveals that this is a unix socket.

cc @thundergolfer @pawalt

Steps to reproduce

Dockerfile

FROM python:3.9-slim
WORKDIR /app
COPY <<EOF /app/main.py
import ray
ray.init()
print("Ray initialized")
EOF
RUN pip install ray
CMD ["python", "main.py"]

Configure docker json

Configure the flags -host-uds=all and -overlay2=none in /etc/docker/daemon.json:

{
    "runtimes": {
        "runsc": {
            "path": "/path/to/runsc",
            "runtimeArgs": ["-host-uds=all", "-overlay2=none"]
        }
    }
}

I also observe the bug when -host-uds=create.

Run with runsc

docker build -t ray-test .
docker run --runtime=runsc ray-test

The expected behavior is that running the script should exit quickly. When running with runsc, I observe the following output before the process hangs:

2024-11-21 18:44:04,736 WARNING services.py:2022 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2024-11-21 18:44:05,024 INFO worker.py:1819 -- Started a local Ray instance.

When running with runc, I observe the same output, but the process exits quickly.

runsc version

runsc version release-20241028.0
spec: 1.1.0-rc.1

docker version (if using docker)

Docker version 27.3.1, build ce12230

uname

Linux ip-10-1-13-177.ec2.internal 5.15.0-301.163.5.2.el9uek.x86_64 #2 SMP Wed Oct 16 18:55:42 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

N/A

repo state (if built from source)

N/A

runsc debug logs (if available)

nixprime commented 14 hours ago

This is caused by a bug in the gofer filesystem client. One process creates a Unix domain socket file:

I1122 01:34:04.790152       1 strace.go:567] [  91:  91] raylet E bind(0x23 socket:[56], 0x7ecfa65d44d0 {Family: AF_UNIX, Addr: "/tmp/ray/session_2024-11-22_01-34-01_491092_1/sockets/raylet"}, 0x3e)
I1122 01:34:04.790206       1 directfs_dentry.go:506] [  91:  91] XXX in directfsDentry.bindAt(raylet)
I1122 01:34:04.790510       1 directfs_dentry.go:517] [  91:  91] XXX directfsDentry.bindAt(raylet): controlFD=13 boundSocketFD=&{14 1442 824634530160}
I1122 01:34:04.790756       1 strace.go:605] [  91:  91] raylet X bind(0x23 socket:[56], 0x7ecfa65d44d0 {Family: AF_UNIX, Addr: "/tmp/ray/session_2024-11-22_01-34-01_491092_1/sockets/raylet"}, 0x3e) = 0 (0x0) (561.104µs)

This creates a gofer.directfsDentry that retains the socket/unix/transport.HostBoundEndpoint representing the bound socket's host file descriptor. However, no reference is held on the dentry, so it can be evicted from the dentry cache, causing the endpoint to be lost. Another process connects to the socket before it is evicted:

I1122 01:34:04.898737       1 strace.go:567] [   1:   1] python E connect(0x12 socket:[62], 0x7ed426e6bc50 {Family: AF_UNIX, Addr: "/tmp/ray/session_2024-11-22_01-34-01_491092_1/sockets/raylet"}, 0x3e)
I1122 01:34:04.898755       1 strace.go:570] [   1: 163] worker.io E epoll_wait(0x10 anon_inode:[eventpoll], 0x7eddccff8170 {}, 0x80, 0xffffffff)
I1122 01:34:04.898783       1 strace.go:608] [   1: 163] worker.io X epoll_wait(0x10 anon_inode:[eventpoll], 0x7eddccff8170 {{events=EPOLLIN data=[-0x43ff56e0, 0x7edd]}}, 0x80, 0xffffffff) = 1 (0x1) (4.286µs)
I1122 01:34:04.898812       1 strace.go:576] [ 148: 148] python E mmap(0x0, 0x40000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 0xffffffffffffffff (bad FD), 0x0)
I1122 01:34:04.898754       1 filesystem.go:1670] [   1:   1] XXX gofer.filesystem.BoundEndpointAt: returning existing endpoint &{baseEndpoint:{Queue:0xc000ee2cc0 DefaultSocketOptionsHandler:{} endpointMutex:{mu:{m:{m:{state:0 sema:0}}}} receiver:<nil> connected:<nil> path:/tmp/ray/session_2024-11-22_01-34-01_491092_1/sockets/raylet ops:{handler:0xc000948000 stackHandler:0x20c6740 broadc>
I1122 01:34:04.898874       1 strace.go:605] [   1:   1] python X connect(0x12 socket:[62], 0x7ed426e6bc50 {Family: AF_UNIX, Addr: "/tmp/ray/session_2024-11-22_01-34-01_491092_1/sockets/raylet"}, 0x3e) = 0 (0x0) (126.39µs)

After vigorous filesystem activity, the socket file's dentry is evicted, so when a later process connects to the same socket it gets a different endpoint:

I1122 01:34:08.672932       1 strace.go:567] [ 164: 164] ray::IDLE E connect(0x10 socket:[94], 0x7ecd34d90cd0 {Family: AF_UNIX, Addr: "/tmp/ray/session_2024-11-22_01-34-01_491092_1/sockets/raylet"}, 0x3e)
I1122 01:34:08.673009       1 filesystem.go:1676] [ 164: 164] XXX gofer.filesystem.BoundEndpointAt: returning new endpoint
I1122 01:34:08.673419       1 strace.go:605] [ 164: 164] ray::IDLE X connect(0x10 socket:[94], 0x7ecd34d90cd0 {Family: AF_UNIX, Addr: "/tmp/ray/session_2024-11-22_01-34-01_491092_1/sockets/raylet"}, 0x3e) = 0 (0x0) (435.392µs)

(The gofer filesystem client synthesizes a gofer.endpoint in this case for historical reasons, cl/146172912 internally.)

I don't think the HostBoundEndpoint can reasonably be recovered if the dentry is dropped, so I think the correct fix is to make dentries with an endpoint (or a pipe) unevictable by e.g. holding an extra reference that is dropped by UnlinkAt(), as we do for synthetic dentries. Simply leaking a reference is sufficient to fix the repro, but is obviously not a complete fix:

diff --git a/pkg/sentry/fsimpl/gofer/filesystem.go b/pkg/sentry/fsimpl/gofer/filesystem.go
index 18810d485..8bcec78c8 100644
--- a/pkg/sentry/fsimpl/gofer/filesystem.go
+++ b/pkg/sentry/fsimpl/gofer/filesystem.go
@@ -881,6 +881,8 @@ func (fs *filesystem) MknodAt(ctx context.Context, rp *vfs.ResolvingPath, opts v
        return fs.doCreateAt(ctx, rp, false /* dir */, func(parent *dentry, name string, ds **[]*dentry) (*dentry, error) {
                creds := rp.Credentials()
                if child, err := parent.mknod(ctx, name, creds, &opts); err == nil {
+                       // XXX
+                       child.IncRef()
                        return child, nil
                } else if !linuxerr.Equals(linuxerr.EPERM, err) {
                        return nil, err