Open jakub-valenta opened 2 years ago
Are you using the ray client btw?
Yes, this is using ray client connecting to cluster.
Should this be bumped to p0?
@jakub-valenta do you have a simple script that we could use to reproduce this?
I was not able to reliable reproduce this. It is not happening often, only once a week. Problem is that it freezes python process forever. Having timeout on remote
method to be able to recover from this situation would be great.
I was able to reproduce this while debugging in ipython console. I have been testing ray client reconnection after networ failure or after restarting ray cluster. I have created dummy function:
@ray.remote
def a(x):
return x + 1
After connecting to remote cluster I was disabling and enabling network also I have restarted ray head and operator nodes. Calling remote function raised an exception:
Out[31]: r = a.remote(5)
2022-02-21 10:22:30,952 WARNING dataclient.py:220 -- Encountered connection issues in the data channel. Attempting to reconnect.
2022-02-21 10:22:57,512 ERROR dataclient.py:150 -- Unrecoverable error in data channel.
Unexpected exception:
Traceback (most recent call last):
File "/home/jakub/workspace/model/model-import/venv/lib/python3.9/site-packages/ray/util/client/logsclient.py", line 68, in _log_main
for record in log_stream:
File "/home/jakub/workspace/model/model-import/venv/lib/python3.9/site-packages/grpc/_channel.py", line 426, in __next__
return self._next()
File "/home/jakub/workspace/model/model-import/venv/lib/python3.9/site-packages/grpc/_channel.py", line 826, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.NOT_FOUND
details = "Logstream proxy failed to connect. Channel for client 466a2a3fe7d64d4eb1786e9d531fe810 not found."
debug_error_string = "{"created":"@1645435387.502123665","description":"Error received from peer ipv4:127.0.1.1:10001","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Logstream proxy failed to connect. Channel for client 466a2a3fe7d64d4eb1786e9d531fe810 not found.","grpc_status":5}"
>
But ray client reported that it is connected to remote cluster:
In [32]: ray.util.client.ray.is_connected()
Out[32]: True
In [33]: r = a.remote(5) # freezes python process
Second remote function call froze python process.
Python version: 3.9.10 Ray version: 1.10.0
Any updates here? Also meet this error and ray tasks will hang.
And even new client connection with ray.init("ray://ip:10001")
hangs as well if "ray_client_server.err" log have grpc
related exceptions.
Putting this as a P1 (looks like there's a repro script). Let's fix it
Search before asking
Ray Component
Ray Clusters
What happened + What you expected to happen
Sometimes calling
my_function.remote(args)
never returns.Used python
faulthandler
module to get stack trace of frozen process and it looks like there is some deadlock or missing timeout on network call:Generally it would great to have timeouts on all ray functions which deal with network. It would make recovery possible in client code.
Versions / Dependencies
Ray 1.8 Debian stable based docker with python 3.9
Reproduction script
Did not found a way to reliably reproduce this but it is triggered by any function.
Anything else
No response
Are you willing to submit a PR?