ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.22k stars 5.81k forks source link

[Core] gcs_server Failed accept4: Too many open files #38248

Open v4if opened 1 year ago

v4if commented 1 year ago

What happened + What you expected to happen

Repeatedly create and delete connections through ray.init, causing the socket fd of gcs_server to leak. The listening application leaks sockets, they are stuck in CLOSE_WAIT TCP state forever.

gcs_server.err log:

E0808 05:56:35.921231490      36 tcp_server_posix.cc:216]    Failed accept4: Too many open files

lsof |grep CLOSE_WAIT:

gcs_serve   19 1918 grpc_glob     ray *558u     IPv6             435469        0t0     TCP raycluster-dev-head-jcbxv:6379->172.17.0.93:36062 (CLOSE_WAIT)
gcs_serve   19 1918 grpc_glob     ray *573u     IPv6             437344        0t0     TCP raycluster-dev-head-jcbxv:6379->172.17.0.109:58762 (CLOSE_WAIT)
gcs_serve   19 1918 grpc_glob     ray *708u     IPv6             432725        0t0     TCP raycluster-dev-head-jcbxv:6379->172.17.0.110:42040 (CLOSE_WAIT)
gcs_serve   19 1918 grpc_glob     ray *924u     IPv6             436698        0t0     TCP raycluster-dev-head-jcbxv:6379->172.17.0.91:47284 (CLOSE_WAIT)
gcs_serve   19 1918 grpc_glob     ray *071u     IPv6             435608        0t0     TCP raycluster-dev-head-jcbxv:6379->172.17.0.111:52680 (CLOSE_WAIT)
gcs_serve   19 1918 grpc_glob     ray *104u     IPv6             438350        0t0     TCP raycluster-dev-head-jcbxv:6379->172.17.0.102:54740 (CLOSE_WAIT)
gcs_serve   19 1918 grpc_glob     ray *284u     IPv6             432888        0t0     TCP raycluster-dev-head-jcbxv:6379->172.17.0.96:58470 (CLOSE_WAIT)
gcs_serve   19 1918 grpc_glob     ray *458u     IPv6             436783        0t0     TCP raycluster-dev-head-jcbxv:6379->172.17.0.108:34328 (CLOSE_WAIT)
gcs_serve   19 1918 grpc_glob     ray *499u     IPv6             435765        0t0     TCP raycluster-dev-head-jcbxv:6379->172.17.0.112:45462 (CLOSE_WAIT)
...

lsof |grep CLOSE_WAIT|wc -l: 57072

Versions / Dependencies

ray, version 2.5.0

Reproduction script

Repeatedly create and delete connections through ray.init.

Issue Severity

High: It blocks me from completing my task.

xieus commented 1 year ago

@iycheng can you triage this one?

fishbone commented 1 year ago

@v4if could you share your script which can reproduce this? Seems like fd leak.

v4if commented 1 year ago

Use the java client to repeatedly call ray.init. Lots of create and delete interactive tasks. @imperio-wxm Could you provide a reproduction script

jjyao commented 1 year ago

Mark as P2 until we have a repro script.

fishbone commented 1 year ago

cc @SongGuyang for java client.