ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.23k stars 5.81k forks source link

ray client will core when timeout reached wile connecting to server #48267

Open deanzxl opened 1 month ago

deanzxl commented 1 month ago

What happened + What you expected to happen

When I use the following code to connect ray server, sometimes timeout reached,and the client generate a coredump file, ray.init(address=f"ray://{head_ip}:{client_server_port}", _node_ip_address=head_ip)

When I track the coredump file, it only shows "/data/dscn/common/python/venv310/bin/python3.10 -m ray.util.client.server --add", and I cannot debug the core file. the exception throwed by the ray client is: Traceback (most recent call last): File "/data/dscn/ml/python/fate_arch/_ray.py", line 96, in ray_init ray.init(address=f"ray://{head_ip}:{client_server_port}", _node_ip_address=head_ip) File "/data/dscn/common/python/venv310/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/data/dscn/common/python/venv310/lib/python3.10/site-packages/ray/_private/worker.py", line 1483, in init ctx = builder.connect() File "/data/dscn/common/python/venv310/lib/python3.10/site-packages/ray/client_builder.py", line 175, in connect client_info_dict = ray.util.client_connect.connect( File "/data/dscn/common/python/venv310/lib/python3.10/site-packages/ray/util/client_connect.py", line 55, in connect conn = ray.connect( File "/data/dscn/common/python/venv310/lib/python3.10/site-packages/ray/util/client/init.py", line 233, in connect conn = self.get_context().connect(args, **kw_args) File "/data/dscn/common/python/venv310/lib/python3.10/site-packages/ray/util/client/init.py", line 97, in connect self.client_worker._server_init(job_config, ray_init_kwargs) File "/data/dscn/common/python/venv310/lib/python3.10/site-packages/ray/util/client/worker.py", line 860, in _server_init raise ConnectionAbortedError( ConnectionAbortedError: Initialization failure from server: Traceback (most recent call last): File "/data/dscn/common/python/venv310/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 711, in Datapath raise RuntimeError( RuntimeError: Starting Ray client server failed. See ray_client_server_23017.err for detailed logs.

However I cannot find the ray_client_server_23017.err both the client and server. I add retry mechanism to try to connect the ray server and it works. But I don't want to generate the core file, I know it can be done to modify the server configs, but can I fix it from the ray side?Thanks.

Versions / Dependencies

server: ubuntu 22.04.4 ray version: 2.35 python version: 3.10

Reproduction script

None

Issue Severity

None

jcotant1 commented 3 weeks ago

Hey @kevin85421 @jjyao - do you know if there's any update on this one? Thx

jjyao commented 1 week ago

Hi @deanzxl do you have a repro?

deanzxl commented 1 week ago

Hi @deanzxl do you have a repro?

This issue doesn't show up every time, as mentioned above, it will happen when I excute ray.init(***). I save a core dump file,is there a way to send the file to you?