Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
We are trying to use Ray's debugging option, but we are having some issues connecting to the breakpoints.
We have been investigating the problem locally with little success so far, and we are now reaching out for some help :)
TL;DR: whenver we try to connect to a breakpoint with ray debug, the host crash with the error:
File "/root/.pyenv/versions/3.10.11/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 63: invalid start byte
which an error that seems to originate from worker.py#L2662:
The full process is described below.
Thank you very much in advance for your help!
First, we start a server with the ray-debugger flag: poetry run ray start --ray-debugger-external --...
Second, we have an object as follow:
@ray.remote(resources={"worker": 1})
class MyRayObject:
def __init__(self, ...):
breakpoint()
Finally, the checkpoint is triggered correctly:
(MyRayObject pid=241, ip=XX.XX.X.XX) RemotePdb session open at XX.XX.X.XX:YYYYY, use 'ray debug' to connect...
We then connect to the debugger with
export RAY_ADDRESS=localhost:YYYYY ; ray debug
We then get the following error from the debugger:
2024-03-07 16:31:04,242 INFO scripts.py:204 -- Connecting to Ray instance at 10.12.8.29:40321.
2024-03-07 16:31:04,242 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 10.12.8.29:40321...
Traceback (most recent call last):
File "/home/.../python3.10/site-packages/ray/_private/worker.py", line 1680, in init
_global_node = ray._private.node.Node(
File "/home/.../python3.10/site-packages/ray/_private/node.py", line 153, in __init__
self._init_gcs_client()
File "/home/.../python3.10/site-packages/ray/_private/node.py", line 730, in _init_gcs_client
raise RuntimeError(
RuntimeError: Failed to connect to GCS.
and from the host:
File "/.../my_ray_object.py", line XX, in __init__
File "/root/.pyenv/versions/3.10.11/lib/python3.10/bdb.py", line 90, in trace_dispatch
return self.dispatch_line(frame)
File "/root/.pyenv/versions/3.10.11/lib/python3.10/bdb.py", line 114, in dispatch_line
self.user_line(frame)
File "/root/.pyenv/versions/3.10.11/lib/python3.10/pdb.py", line 253, in user_line
self.interaction(frame, None)
File "/root/.pyenv/versions/3.10.11/lib/python3.10/pdb.py", line 348, in interaction
self._cmdloop()
File "/root/.pyenv/versions/3.10.11/lib/python3.10/pdb.py", line 313, in _cmdloop
self.cmdloop()
File "/root/.pyenv/versions/3.10.11/lib/python3.10/cmd.py", line 132, in cmdloop
line = self.stdin.readline()
File "/root/.pyenv/versions/3.10.11/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 63: invalid start byte
Exploring further, it seems that the error raises here:
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
for i, value in enumerate(values):
if isinstance(value, RayError):
if isinstance(value, ray.exceptions.ObjectLostError):
worker.core_worker.dump_object_store_memory_usage()
if isinstance(value, RayTaskError):
raise value.as_instanceof_cause() # HERE
This is all the information we could collect, and we do not know what to do further :-/
What happened + What you expected to happen
Hi folks,
We are trying to use Ray's debugging option, but we are having some issues connecting to the breakpoints. We have been investigating the problem locally with little success so far, and we are now reaching out for some help :)
TL;DR: whenver we try to connect to a breakpoint with ray debug, the host crash with the error:
which an error that seems to originate from worker.py#L2662:
The full process is described below.
Thank you very much in advance for your help!
First, we start a server with the ray-debugger flag: poetry
run ray start --ray-debugger-external --...
Second, we have an object as follow:
Third, we initialize ray as follow:
Finally, the checkpoint is triggered correctly:
(MyRayObject pid=241, ip=XX.XX.X.XX) RemotePdb session open at XX.XX.X.XX:YYYYY, use 'ray debug' to connect...
We then connect to the debugger withexport RAY_ADDRESS=localhost:YYYYY ; ray debug
We then get the following error from the debugger:
and from the host:
Exploring further, it seems that the error raises here:
This is all the information we could collect, and we do not know what to do further :-/
Versions / Dependencies
ray 2.9.3 Ray provides a simple, universal API for building distributed applications. ├── aiosignal │ └── frozenlist >=1.1.0 ├── click >=7.0 │ └── colorama ├── filelock ├── frozenlist ├── jsonschema │ ├── attrs >=22.2.0 │ ├── jsonschema-specifications >=2023.03.6 │ │ └── referencing >=0.31.0 │ │ ├── attrs >=22.2.0 (circular dependency aborted here) │ │ └── rpds-py >=0.7.0 │ ├── referencing >=0.28.4 (circular dependency aborted here) │ └── rpds-py >=0.7.1 (circular dependency aborted here) ├── msgpack >=1.0.0,<2.0.0 ├── packaging ├── protobuf >=3.15.3,<3.19.5 || >3.19.5 ├── pyyaml └── requests ├── certifi >=2017.4.17 ├── charset-normalizer >=2,<4 ├── idna >=2.5,<4 └── urllib3 >=1.21.1,<3
Reproduction script
cf above.
Issue Severity
Medium: It is a significant difficulty but I can work around it.