ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.31k stars 418 forks source link

[Bug] `ray debug –address=<ip:port>` results in UnicodeDecodeError #913

Closed zcarrico-fn closed 1 year ago

zcarrico-fn commented 1 year ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

Observation

Expectation

Useful Information

Reproduction script

simple_task.py

import ray
ray.init("ray://ray-examples-head-svc:10001")

@ray.remote
def f(x):
    breakpoint()
    return x * x

futures = [f.remote(i) for i in range(2)]
print(ray.get(futures))

terminal 1

~ python ray_examples/simple_task.py
(f pid=4037, ip=10.192.6.5) RemotePdb session open at 10.192.6.5:44835, use 'ray debug' to connect...
(f pid=4036, ip=10.192.6.5) RemotePdb session open at 10.192.6.5:38031, use 'ray debug' to connect...

terminal 2

~ ray debug --address=10.192.6.5:38031
2023-02-02 14:58:05,783 INFO scripts.py:206 -- Connecting to Ray instance at 10.192.6.5:38031.
2023-02-02 14:58:05,783 INFO worker.py:1352 -- Connecting to existing Ray cluster at address: 10.192.6.5:38031...

after running ray debug --address=10.192.6.5:38031 terminal 1 shows this

~ python ray_examples/simple_task.py
(f pid=4037, ip=10.192.6.5) RemotePdb session open at 10.192.6.5:44835, use 'ray debug' to connect...
(f pid=4036, ip=10.192.6.5) RemotePdb session open at 10.192.6.5:38031, use 'ray debug' to connect...
Traceback (most recent call last):
  File "/home/zcarrico/ray-examples/ray_examples/simple_task.py", line 13, in <module>
    print(ray.get(futures))
  File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
  File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 434, in get
    res = self._get(to_get, op_timeout)
  File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 462, in _get
    raise err
types.RayTaskError(UnicodeDecodeError): ray::f() (pid=4036, ip=10.192.6.5)
  File "/home/zcarrico/ray-examples/ray_examples/simple_task.py", line 9, in f
  File "/home/zcarrico/ray-examples/ray_examples/simple_task.py", line 9, in f
  File "/fn/lib/python3.10/bdb.py", line 90, in trace_dispatch
    return self.dispatch_line(frame)
  File "/fn/lib/python3.10/bdb.py", line 114, in dispatch_line
    self.user_line(frame)
  File "/fn/lib/python3.10/pdb.py", line 262, in user_line
    self.interaction(frame, None)
  File "/fn/lib/python3.10/pdb.py", line 357, in interaction
    self._cmdloop()
  File "/fn/lib/python3.10/pdb.py", line 322, in _cmdloop
    self.cmdloop()
  File "/fn/lib/python3.10/cmd.py", line 132, in cmdloop
    line = self.stdin.readline()
  File "/fn/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 49: invalid start byte

Anything else

No response

Are you willing to submit a PR?

kevin85421 commented 1 year ago

Hi @zcarrico-fn, would you mind sharing more details (e.g. your environment)? I cannot reproduce the error with the following commands. By the way, you should use port 6379 (GCS) when you use ray debug --address=$IP:$PORT.

kind create cluster

# Install KubeRay operator and RayCluster
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 0.4.0
helm install raycluster kuberay/ray-cluster --version 0.4.0

# Terminal 1
# Note: Update the name of Kubernetes service => ray.init("ray://raycluster-kuberay-head-svc:10001")
kubectl exec -it $HEAD_POD -- bash
python3 simple_task.py

# (base) ray@raycluster-kuberay-head-shpnl:~$ python3 simple_task.py
# (f pid=946) RemotePdb session open at localhost:34183, use 'ray debug' to connect...
# (f pid=113, ip=10.244.0.7) RemotePdb session open at localhost:41001, use 'ray debug' to connect...

# Terminal 2
kubectl exec -it $HEAD_POD -- bash
ray debug --address=raycluster-kuberay-head-svc:6379
# 2023-02-21 16:10:34,207 INFO scripts.py:209 -- Connecting to Ray instance at raycluster-kuberay-head-svc:6379.
# 2023-02-21 16:10:34,208 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: raycluster-kuberay-head-svc:6379...
# 2023-02-21 16:10:34,228 INFO worker.py:1515 -- Connected to Ray cluster. View the dashboard at http://10.244.0.6:8265
# Active breakpoints:
# index | timestamp           | Ray task | filename:lineno
# 0     | 2023-02-22 00:06:27 | ray::f() | simple_task.py:6
# 1     | 2023-02-22 00:06:26 | ray::f() | simple_task.py:6
# Enter breakpoint index or press enter to refresh:
zcarrico-fn commented 1 year ago

Hi @kevin85421 , thank you for the example! We're using GKE with the kuberay operator IaC handled by Pulumi. Below are snippets from Pulumi and the CRD yaml we're using with certain values x'd out for privacy.

I attempted connecting to port 6379 as you suggested, but it results in the below error. Is there any other information I can provide about our environment or ideas you have to test things out?

ray debug --address=ray-examples-head-svc:6379
2023-02-23 21:26:25,308 INFO scripts.py:206 -- Connecting to Ray instance at ray-examples-head-svc:6379.
2023-02-23 21:26:25,309 INFO worker.py:1352 -- Connecting to existing Ray cluster at address: ray-examples-head-svc:6379...
[2023-02-23 21:26:25,323 W 2304175 2304175] global_state_accessor.cc:390: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
...
Traceback (most recent call last):
  File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2386, in main
    return cli()
  File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/scripts/scripts.py", line 207, in debug
    ray.init(address=address, log_to_driver=False)
  File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/_private/worker.py", line 1494, in init
    _global_node = ray._private.node.Node(
  File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/_private/node.py", line 226, in __init__
    node_info = ray._private.services.get_node_to_connect_for_driver(
  File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/_private/services.py", line 442, in get_node_to_connect_for_driver
    return global_state.get_node_to_connect_for_driver(node_ip_address)
  File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/_private/state.py", line 730, in get_node_to_connect_for_driver
    node_info_str = self.global_state_accessor.get_node_to_connect_for_driver(
  File "python/ray/includes/global_state_accessor.pxi", line 155, in ray._raylet.GlobalStateAccessor.get_node_to_connect_for_driver
RuntimeError: b"This node has an IP address of 10.192.27.25, and Ray expects this IP address to be either the GCS address or one of the Raylet addresses. Connected to GCS at ray-examples-head-svc and found raylets at 10.193.134.119, 10.192.97.6, 10.192.97.5 but none of these match this node's IP 10.192.27.25. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node."

Pulumi configuration:

  computational:kuberay-operator:
    version: 0.4.0
if kuberay_operator_config := config.get_object("kuberay-operator"):
    kuberay_operator = Release(
        "kuberay-operator",
        ReleaseArgs(
            chart="kuberay-operator",
            repository_opts=RepositoryOptsArgs(
                repo="https://ray-project.github.io/kuberay-helm/"
            ),
            name="kuberay-operator",
            namespace="ray-system",
            version=kuberay_operator_config["version"],
            create_namespace=True,
        ),
        create_namespace=True,
    )

cluster CRD yaml (deployed using kubectl apply -f $(CRD_YAML) -n $(DEPLOYMENT_NAMESPACE)):

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
    name: ray-examples
    namespace: default
spec:
    rayVersion: "2.2.0"
    headGroupSpec:
        serviceType: ClusterIP
        replicas: 1
        rayStartParams:
            ray-debugger-external: "true"
            block: "true"
            metrics-export-port: "8080"
            node-ip-address: $(__POD_IP__)
            num-cpus: "0"
        template:
            spec:
                affinity:
                    nodeAffinity:
                        requiredDuringSchedulingIgnoredDuringExecution:
                            nodeSelectorTerms:
                                - matchExpressions:
                                      - key: xxxxxxxxxx
                                        operator: In
                                        values:
                                            - xxxxxx
                                      - key: xxxxxxxxxx
                                        operator: In
                                        values:
                                            - xxxxxxxx
                                      - key: cloud.google.com/gke-preemptible
                                        operator: DoesNotExist
                containers:
                    - name: ray-head
                      image: xxxxxxxxxxxxxxx
                      imagePullPolicy: IfNotPresent
                      args:
                          - source
                          - /fn/lib/venv/bin/activate
                      env:
                          - name: __POD_IP__
                            valueFrom:
                                fieldRef:
                                    fieldPath: status.podIP
                          - name: xxxxxxxxxxxxx
                            value: xxxxxxxxxxx
                      volumeMounts:
                          - mountPath: xxxxxxxxxxx
                            name: xxxxxxxxxxxx
                      ports:
                          - containerPort: 6379
                            name: redis
                            protocol: TCP
                          - containerPort: 8080
                            name: metrics
                            protocol: TCP
                          - containerPort: 10001
                            name: server
                            protocol: TCP
                      resources:
                          limits:
                              memory: 16Gi
                          requests:
                              cpu: "4"
                      lifecycle:
                          preStop:
                              exec:
                                  command:
                                      - /bin/sh
                                      - -c
                                      - ray stop
                serviceAccountName: xxxxxxxxx
                volumes:
                    - name: xxxxxxxxxxxxx
                      secret:
                          defaultMode: xxxxxx
                          secretName: xxxxxxxxxxxx
                tolerations:
                    - effect: NoSchedule
                      key: xxxxxxxxxx
                      operator: Equal
                      value: xxxxxxx
    workerGroupSpecs:
        - groupName: main
          replicas: 2
          minReplicas: 2
          maxReplicas: 2
          rayStartParams:
              ray-debugger-external: "true"
              block: "true"
              metrics-export-port: "8080"
              node-ip-address: $(__POD_IP__)
          template:
              spec:
                  affinity:
                      nodeAffinity:
                          requiredDuringSchedulingIgnoredDuringExecution:
                              nodeSelectorTerms:
                                  - matchExpressions:
                                        - key: xxxxxxxxxxx
                                          operator: In
                                          values:
                                              - xxxxxxxx
                                        - key: xxxxxxxxx
                                          operator: In
                                          values:
                                              - xxxxxxxx
                                        - key: cloud.google.com/gke-preemptible
                                          operator: Exists
                  initContainers:
                      - name: wait-for-head-service
                        image: public.ecr.aws/docker/library/busybox:stable
                        command:
                            - sh
                            - -c
                            - |
                                until nc -z $RAY_IP.$(__POD_NAMESPACE__).svc.cluster.local 10001; do
                                  sleep 0.1
                                done
                        env:
                            - name: __POD_NAMESPACE__
                              valueFrom:
                                  fieldRef:
                                      fieldPath: metadata.namespace
                  containers:
                      - name: ray-worker
                        image: xxxxxxxxxxxx
                        imagePullPolicy: IfNotPresent
                        args:
                            - source
                            - /fn/lib/venv/bin/activate
                        env:
                            - name: RAY_DISABLE_DOCKER_CPU_WARNING
                              value: "1"
                            - name: TYPE
                              value: worker
                            - name: __POD_IP__
                              valueFrom:
                                  fieldRef:
                                      fieldPath: status.podIP
                            - name: xxxxxxxx
                              value: xxxxxxxx
                        volumeMounts:
                            - mountPath: xxxxxxxxxxxx 
                              name: xxxxxxxxx
                        ports:
                            - containerPort: 8080
                              name: metrics
                              protocol: TCP
                        resources:
                            limits:
                                memory: 16Gi
                            requests:
                                cpu: "4"
                        lifecycle:
                            preStop:
                                exec:
                                    command:
                                        - /bin/sh
                                        - -c
                                        - ray stop
                  serviceAccountName: xxxxxxxx
                  volumes:
                      - name: xxxxxxxxxxx
                        secret:
                            defaultMode: xxxxx
                            secretName: xxxxxxxxx
                  tolerations:
                      - effect: NoSchedule
                        key: xxxxxxxxx
                        operator: Equal
                        value: xxxxxx
zcarrico-fn commented 1 year ago

@kevin85421 , If we exec into the head node and run ray debug from there, debugging works. Do you know if this is the intended behavior or should it be possible to ray debug from outside the head/worker nodes? Are all Ray CLI commands intended to be run from the head node or a worker node or is ray debug unique in this?

kevin85421 commented 1 year ago

@kevin85421 , If we exec into the head node and run ray debug from there, debugging works. Do you know if this is the intended behavior or should it be possible to ray debug from outside the head/worker nodes? Are all Ray CLI commands intended to be run from the head node or a worker node or is ray debug unique in this?

kind create cluster

# Install KubeRay operator and RayCluster
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 0.4.0
helm install raycluster kuberay/ray-cluster --version 0.4.0

# Terminal 1: Create a new Pod "raypod"
# Note: Update the name of Kubernetes service => ray.init("ray://raycluster-kuberay-head-svc:10001")
kubectl run raypod --image=rayproject/ray:2.0.0 -i --tty
python3 simple_task.py

# (base) ray@raypod:~$ python3 simple_task.py
# (f pid=2582) RemotePdb session open at localhost:43059, use 'ray debug' to connect...
# (f pid=174, ip=10.244.0.6) RemotePdb session open at localhost:39447, use 'ray debug' to connect...

# Terminal 2
kubectl exec -it raypod -- bash

# Check the healthiness of Ray GCS. If the exit code is 0, the cluster is healthy.
# (base) ray@ray:~$ ray health-check --address raycluster-kuberay-head-svc:6379
# (base) ray@ray:~$ echo $?
# 0

ray debug --address=raycluster-kuberay-head-svc:6379
# 2023-03-06 10:49:21,296 INFO scripts.py:209 -- Connecting to Ray instance at raycluster-kuberay-head-svc:6379.
# 2023-03-06 10:49:21,296 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: raycluster-kuberay-head-svc:6379...
# .
# .
# .
# [2023-03-06 10:49:21,315 W 932 932] global_state_accessor.cc:390: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
# .
# .
# .
# File "python/ray/includes/global_state_accessor.pxi", line 155, in ray._raylet.GlobalStateAccessor.get_node_to_connect_for_driver
# RuntimeError: b"This node has an IP address of 10.244.0.12, and Ray expects this IP address to be either the GCS address or one
# of the Raylet addresses. Connected to GCS at raycluster-kuberay-head-svc and found raylets at 10.244.0.7, 10.244.0.6
# but none of these match this node's IP 10.244.0.12. Are any of these actually a different IP address for the same node?
# You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node."

I am not an expert in ray debug, but I will answer the following questions based on the above experiment.

Do you know if this is the intended behavior or should it be possible to ray debug from outside the head/worker nodes?

I will say no based on the error message in the experiment, but cc ray debug experts @pcmoritz @rkooo567 to confirm.

Are all Ray CLI commands intended to be run from the head node or a worker node or is ray debug unique in this?

I believed that ray debug is a special case. A lot of Ray CLI commands, e.g. ray job, ray health-check, can be running on a node that not register for GCS.

zcarrico-fn commented 1 year ago

Thank you @kevin85421 ! By adding dashboard-host: "0.0.0.0" to rayStartParams in the CRD configuration file, @jeevb was able to get many of the Ray CLI commands to work from JupyterHub nodes in the same Kubernetes namespace as our Ray cluster.

I will update this comment if I find other CLI commands that only work from the head node.

kevin85421 commented 1 year ago

Thank you @kevin85421 ! By adding dashboard-host: "0.0.0.0" to rayStartParams in the CRD configuration file, @jeevb was able to get many of the Ray CLI commands to work from JupyterHub nodes in the same Kubernetes namespace as our Ray cluster.

  • Ray CLI commands that so far only work from the head node are debug and logs (possible related to this open issue)
  • Ray CLI commands that we've tested and work from outside the head node are list, memory, and status (there are probably many more that work from this node).

I will update this comment if I find other CLI commands that only work from the head node.

Thank you! This is very helpful! cc @gvspraveen