ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.28k stars 5.82k forks source link

[Core] [Bug] No timeout or deadlock on scheduling job in remote cluster #21419

Open jakub-valenta opened 2 years ago

jakub-valenta commented 2 years ago

Search before asking

Ray Component

Ray Clusters

What happened + What you expected to happen

Sometimes calling my_function.remote(args) never returns.

Used python faulthandler module to get stack trace of frozen process and it looks like there is some deadlock or missing timeout on network call:

  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/dataclient.py", line 281 in _async_send
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/dataclient.py", line 363 in ReleaseObject
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/worker.py", line 532 in _release_server
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/worker.py", line 526 in call_release
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/api.py", line 118 in call_release
  File "/usr/lib/python3.9/queue.py", line 133 in put
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/dataclient.py", line 287 in _async_send
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/dataclient.py", line 368 in Schedule
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/worker.py", line 500 in _call_schedule_for_task
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/worker.py", line 459 in call_remote
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/api.py", line 106 in call_remote
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/common.py", line 380 in remote
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/common.py", line 130 in _remote
  File "/usr/local/lib/python3.9/dist-packages/ray/_private/client_mode_hook.py", line 173 in client_mode_convert_function
  File "/usr/local/lib/python3.9/dist-packages/ray/remote_function.py", line 222 in _remote
  File "/usr/local/lib/python3.9/dist-packages/ray/util/tracing/tracing_helper.py", line 295 in _invocation_remote_span
  File "/usr/local/lib/python3.9/dist-packages/ray/remote_function.py", line 180 in remote

Generally it would great to have timeouts on all ray functions which deal with network. It would make recovery possible in client code.

Versions / Dependencies

Ray 1.8 Debian stable based docker with python 3.9

Reproduction script

Did not found a way to reliably reproduce this but it is triggered by any function.

Anything else

No response

Are you willing to submit a PR?

rkooo567 commented 2 years ago

Are you using the ray client btw?

jakub-valenta commented 2 years ago

Yes, this is using ray client connecting to cluster.

scv119 commented 2 years ago

Should this be bumped to p0?

scv119 commented 2 years ago

@jakub-valenta do you have a simple script that we could use to reproduce this?

jakub-valenta commented 2 years ago

I was not able to reliable reproduce this. It is not happening often, only once a week. Problem is that it freezes python process forever. Having timeout on remote method to be able to recover from this situation would be great.

jakub-valenta commented 2 years ago

I was able to reproduce this while debugging in ipython console. I have been testing ray client reconnection after networ failure or after restarting ray cluster. I have created dummy function:

@ray.remote
def a(x):
    return x + 1

After connecting to remote cluster I was disabling and enabling network also I have restarted ray head and operator nodes. Calling remote function raised an exception:

Out[31]: r = a.remote(5)
 2022-02-21 10:22:30,952        WARNING dataclient.py:220 -- Encountered connection issues in the data channel. Attempting to reconnect.
2022-02-21 10:22:57,512 ERROR dataclient.py:150 -- Unrecoverable error in data channel.
Unexpected exception:
Traceback (most recent call last):
  File "/home/jakub/workspace/model/model-import/venv/lib/python3.9/site-packages/ray/util/client/logsclient.py", line 68, in _log_main
    for record in log_stream:
  File "/home/jakub/workspace/model/model-import/venv/lib/python3.9/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/home/jakub/workspace/model/model-import/venv/lib/python3.9/site-packages/grpc/_channel.py", line 826, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.NOT_FOUND
        details = "Logstream proxy failed to connect. Channel for client 466a2a3fe7d64d4eb1786e9d531fe810 not found."
        debug_error_string = "{"created":"@1645435387.502123665","description":"Error received from peer ipv4:127.0.1.1:10001","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Logstream proxy failed to connect. Channel for client 466a2a3fe7d64d4eb1786e9d531fe810 not found.","grpc_status":5}"
>

But ray client reported that it is connected to remote cluster:

In [32]: ray.util.client.ray.is_connected()
Out[32]: True
In [33]: r = a.remote(5) # freezes python process

Second remote function call froze python process.

Python version: 3.9.10 Ray version: 1.10.0

xychu commented 2 years ago

Any updates here? Also meet this error and ray tasks will hang. And even new client connection with ray.init("ray://ip:10001") hangs as well if "ray_client_server.err" log have grpc related exceptions.

zhe-thoughts commented 2 years ago

Putting this as a P1 (looks like there's a repro script). Let's fix it