ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.25k stars 5.49k forks source link

[<Ray component: Core|RLlib|etc...>] #43782

Open fstrub95-cohere opened 4 months ago

fstrub95-cohere commented 4 months ago

What happened + What you expected to happen

Hi folks,

We are trying to use Ray's debugging option, but we are having some issues connecting to the breakpoints. We have been investigating the problem locally with little success so far, and we are now reaching out for some help :)

TL;DR: whenver we try to connect to a breakpoint with ray debug, the host crash with the error:

  File "/root/.pyenv/versions/3.10.11/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 63: invalid start byte

which an error that seems to originate from worker.py#L2662:

The full process is described below.

Thank you very much in advance for your help!


First, we start a server with the ray-debugger flag: poetry run ray start --ray-debugger-external --...

Second, we have an object as follow:

@ray.remote(resources={"worker": 1})
class MyRayObject:
    def __init__(self, ...):
        breakpoint()

Third, we initialize ray as follow:

  ray_cluster_info = ray.init(
      address="local",
      include_dashboard=...,
      dashboard_host="0.0.0.0",
      resources=...
      runtime_env={"worker_process_setup_hook": some_fn},
  )

Finally, the checkpoint is triggered correctly: (MyRayObject pid=241, ip=XX.XX.X.XX) RemotePdb session open at XX.XX.X.XX:YYYYY, use 'ray debug' to connect... We then connect to the debugger with export RAY_ADDRESS=localhost:YYYYY ; ray debug

We then get the following error from the debugger:

2024-03-07 16:31:04,242 INFO scripts.py:204 -- Connecting to Ray instance at 10.12.8.29:40321.
2024-03-07 16:31:04,242 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 10.12.8.29:40321...
Traceback (most recent call last):
  File "/home/.../python3.10/site-packages/ray/_private/worker.py", line 1680, in init
    _global_node = ray._private.node.Node(
  File "/home/.../python3.10/site-packages/ray/_private/node.py", line 153, in __init__
    self._init_gcs_client()
  File "/home/.../python3.10/site-packages/ray/_private/node.py", line 730, in _init_gcs_client
    raise RuntimeError(
RuntimeError: Failed to connect to GCS.

and from the host:

  File "/.../my_ray_object.py", line XX, in __init__
  File "/root/.pyenv/versions/3.10.11/lib/python3.10/bdb.py", line 90, in trace_dispatch
    return self.dispatch_line(frame)
  File "/root/.pyenv/versions/3.10.11/lib/python3.10/bdb.py", line 114, in dispatch_line
    self.user_line(frame)
  File "/root/.pyenv/versions/3.10.11/lib/python3.10/pdb.py", line 253, in user_line
    self.interaction(frame, None)
  File "/root/.pyenv/versions/3.10.11/lib/python3.10/pdb.py", line 348, in interaction
    self._cmdloop()
  File "/root/.pyenv/versions/3.10.11/lib/python3.10/pdb.py", line 313, in _cmdloop
    self.cmdloop()
  File "/root/.pyenv/versions/3.10.11/lib/python3.10/cmd.py", line 132, in cmdloop
    line = self.stdin.readline()
  File "/root/.pyenv/versions/3.10.11/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 63: invalid start byte

Exploring further, it seems that the error raises here:

            values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
            for i, value in enumerate(values):
                if isinstance(value, RayError):
                    if isinstance(value, ray.exceptions.ObjectLostError):
                        worker.core_worker.dump_object_store_memory_usage()
                    if isinstance(value, RayTaskError):
                        raise value.as_instanceof_cause()  # HERE

This is all the information we could collect, and we do not know what to do further :-/

Versions / Dependencies

ray 2.9.3 Ray provides a simple, universal API for building distributed applications. ├── aiosignal │ └── frozenlist >=1.1.0 ├── click >=7.0 │ └── colorama ├── filelock ├── frozenlist ├── jsonschema │ ├── attrs >=22.2.0 │ ├── jsonschema-specifications >=2023.03.6 │ │ └── referencing >=0.31.0 │ │ ├── attrs >=22.2.0 (circular dependency aborted here) │ │ └── rpds-py >=0.7.0 │ ├── referencing >=0.28.4 (circular dependency aborted here) │ └── rpds-py >=0.7.1 (circular dependency aborted here) ├── msgpack >=1.0.0,<2.0.0 ├── packaging ├── protobuf >=3.15.3,<3.19.5 || >3.19.5 ├── pyyaml └── requests ├── certifi >=2017.4.17 ├── charset-normalizer >=2,<4 ├── idna >=2.5,<4 └── urllib3 >=1.21.1,<3

Reproduction script

cf above.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

anyscalesam commented 1 month ago

cc @chris-ray-zhang > @fstrub95-cohere have you checked out the new Ray Debugger? https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/ray-debugging.html