Closed zcarrico-fn closed 1 year ago
Hi @zcarrico-fn, would you mind sharing more details (e.g. your environment)? I cannot reproduce the error with the following commands. By the way, you should use port 6379
(GCS) when you use ray debug --address=$IP:$PORT
.
kind create cluster
# Install KubeRay operator and RayCluster
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 0.4.0
helm install raycluster kuberay/ray-cluster --version 0.4.0
# Terminal 1
# Note: Update the name of Kubernetes service => ray.init("ray://raycluster-kuberay-head-svc:10001")
kubectl exec -it $HEAD_POD -- bash
python3 simple_task.py
# (base) ray@raycluster-kuberay-head-shpnl:~$ python3 simple_task.py
# (f pid=946) RemotePdb session open at localhost:34183, use 'ray debug' to connect...
# (f pid=113, ip=10.244.0.7) RemotePdb session open at localhost:41001, use 'ray debug' to connect...
# Terminal 2
kubectl exec -it $HEAD_POD -- bash
ray debug --address=raycluster-kuberay-head-svc:6379
# 2023-02-21 16:10:34,207 INFO scripts.py:209 -- Connecting to Ray instance at raycluster-kuberay-head-svc:6379.
# 2023-02-21 16:10:34,208 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: raycluster-kuberay-head-svc:6379...
# 2023-02-21 16:10:34,228 INFO worker.py:1515 -- Connected to Ray cluster. View the dashboard at http://10.244.0.6:8265
# Active breakpoints:
# index | timestamp | Ray task | filename:lineno
# 0 | 2023-02-22 00:06:27 | ray::f() | simple_task.py:6
# 1 | 2023-02-22 00:06:26 | ray::f() | simple_task.py:6
# Enter breakpoint index or press enter to refresh:
Hi @kevin85421 , thank you for the example! We're using GKE with the kuberay operator IaC handled by Pulumi. Below are snippets from Pulumi and the CRD yaml we're using with certain values x'd out for privacy.
I attempted connecting to port 6379 as you suggested, but it results in the below error. Is there any other information I can provide about our environment or ideas you have to test things out?
ray debug --address=ray-examples-head-svc:6379
2023-02-23 21:26:25,308 INFO scripts.py:206 -- Connecting to Ray instance at ray-examples-head-svc:6379.
2023-02-23 21:26:25,309 INFO worker.py:1352 -- Connecting to existing Ray cluster at address: ray-examples-head-svc:6379...
[2023-02-23 21:26:25,323 W 2304175 2304175] global_state_accessor.cc:390: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
...
Traceback (most recent call last):
File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2386, in main
return cli()
File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/scripts/scripts.py", line 207, in debug
ray.init(address=address, log_to_driver=False)
File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/_private/worker.py", line 1494, in init
_global_node = ray._private.node.Node(
File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/_private/node.py", line 226, in __init__
node_info = ray._private.services.get_node_to_connect_for_driver(
File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/_private/services.py", line 442, in get_node_to_connect_for_driver
return global_state.get_node_to_connect_for_driver(node_ip_address)
File "/home/zcarrico/.virtualenvs/freenome-ray-examples-QntpqrFU-py3.10/lib/python3.10/site-packages/ray/_private/state.py", line 730, in get_node_to_connect_for_driver
node_info_str = self.global_state_accessor.get_node_to_connect_for_driver(
File "python/ray/includes/global_state_accessor.pxi", line 155, in ray._raylet.GlobalStateAccessor.get_node_to_connect_for_driver
RuntimeError: b"This node has an IP address of 10.192.27.25, and Ray expects this IP address to be either the GCS address or one of the Raylet addresses. Connected to GCS at ray-examples-head-svc and found raylets at 10.193.134.119, 10.192.97.6, 10.192.97.5 but none of these match this node's IP 10.192.27.25. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node."
Pulumi configuration:
computational:kuberay-operator:
version: 0.4.0
if kuberay_operator_config := config.get_object("kuberay-operator"):
kuberay_operator = Release(
"kuberay-operator",
ReleaseArgs(
chart="kuberay-operator",
repository_opts=RepositoryOptsArgs(
repo="https://ray-project.github.io/kuberay-helm/"
),
name="kuberay-operator",
namespace="ray-system",
version=kuberay_operator_config["version"],
create_namespace=True,
),
create_namespace=True,
)
cluster CRD yaml (deployed using kubectl apply -f $(CRD_YAML) -n $(DEPLOYMENT_NAMESPACE)
):
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
name: ray-examples
namespace: default
spec:
rayVersion: "2.2.0"
headGroupSpec:
serviceType: ClusterIP
replicas: 1
rayStartParams:
ray-debugger-external: "true"
block: "true"
metrics-export-port: "8080"
node-ip-address: $(__POD_IP__)
num-cpus: "0"
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: xxxxxxxxxx
operator: In
values:
- xxxxxx
- key: xxxxxxxxxx
operator: In
values:
- xxxxxxxx
- key: cloud.google.com/gke-preemptible
operator: DoesNotExist
containers:
- name: ray-head
image: xxxxxxxxxxxxxxx
imagePullPolicy: IfNotPresent
args:
- source
- /fn/lib/venv/bin/activate
env:
- name: __POD_IP__
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: xxxxxxxxxxxxx
value: xxxxxxxxxxx
volumeMounts:
- mountPath: xxxxxxxxxxx
name: xxxxxxxxxxxx
ports:
- containerPort: 6379
name: redis
protocol: TCP
- containerPort: 8080
name: metrics
protocol: TCP
- containerPort: 10001
name: server
protocol: TCP
resources:
limits:
memory: 16Gi
requests:
cpu: "4"
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- ray stop
serviceAccountName: xxxxxxxxx
volumes:
- name: xxxxxxxxxxxxx
secret:
defaultMode: xxxxxx
secretName: xxxxxxxxxxxx
tolerations:
- effect: NoSchedule
key: xxxxxxxxxx
operator: Equal
value: xxxxxxx
workerGroupSpecs:
- groupName: main
replicas: 2
minReplicas: 2
maxReplicas: 2
rayStartParams:
ray-debugger-external: "true"
block: "true"
metrics-export-port: "8080"
node-ip-address: $(__POD_IP__)
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: xxxxxxxxxxx
operator: In
values:
- xxxxxxxx
- key: xxxxxxxxx
operator: In
values:
- xxxxxxxx
- key: cloud.google.com/gke-preemptible
operator: Exists
initContainers:
- name: wait-for-head-service
image: public.ecr.aws/docker/library/busybox:stable
command:
- sh
- -c
- |
until nc -z $RAY_IP.$(__POD_NAMESPACE__).svc.cluster.local 10001; do
sleep 0.1
done
env:
- name: __POD_NAMESPACE__
valueFrom:
fieldRef:
fieldPath: metadata.namespace
containers:
- name: ray-worker
image: xxxxxxxxxxxx
imagePullPolicy: IfNotPresent
args:
- source
- /fn/lib/venv/bin/activate
env:
- name: RAY_DISABLE_DOCKER_CPU_WARNING
value: "1"
- name: TYPE
value: worker
- name: __POD_IP__
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: xxxxxxxx
value: xxxxxxxx
volumeMounts:
- mountPath: xxxxxxxxxxxx
name: xxxxxxxxx
ports:
- containerPort: 8080
name: metrics
protocol: TCP
resources:
limits:
memory: 16Gi
requests:
cpu: "4"
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- ray stop
serviceAccountName: xxxxxxxx
volumes:
- name: xxxxxxxxxxx
secret:
defaultMode: xxxxx
secretName: xxxxxxxxx
tolerations:
- effect: NoSchedule
key: xxxxxxxxx
operator: Equal
value: xxxxxx
@kevin85421 , If we exec into the head node and run ray debug
from there, debugging works. Do you know if this is the intended behavior or should it be possible to ray debug
from outside the head/worker nodes? Are all Ray CLI commands intended to be run from the head node or a worker node or is ray debug
unique in this?
@kevin85421 , If we exec into the head node and run
ray debug
from there, debugging works. Do you know if this is the intended behavior or should it be possible toray debug
from outside the head/worker nodes? Are all Ray CLI commands intended to be run from the head node or a worker node or isray debug
unique in this?
kind create cluster
# Install KubeRay operator and RayCluster
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 0.4.0
helm install raycluster kuberay/ray-cluster --version 0.4.0
# Terminal 1: Create a new Pod "raypod"
# Note: Update the name of Kubernetes service => ray.init("ray://raycluster-kuberay-head-svc:10001")
kubectl run raypod --image=rayproject/ray:2.0.0 -i --tty
python3 simple_task.py
# (base) ray@raypod:~$ python3 simple_task.py
# (f pid=2582) RemotePdb session open at localhost:43059, use 'ray debug' to connect...
# (f pid=174, ip=10.244.0.6) RemotePdb session open at localhost:39447, use 'ray debug' to connect...
# Terminal 2
kubectl exec -it raypod -- bash
# Check the healthiness of Ray GCS. If the exit code is 0, the cluster is healthy.
# (base) ray@ray:~$ ray health-check --address raycluster-kuberay-head-svc:6379
# (base) ray@ray:~$ echo $?
# 0
ray debug --address=raycluster-kuberay-head-svc:6379
# 2023-03-06 10:49:21,296 INFO scripts.py:209 -- Connecting to Ray instance at raycluster-kuberay-head-svc:6379.
# 2023-03-06 10:49:21,296 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: raycluster-kuberay-head-svc:6379...
# .
# .
# .
# [2023-03-06 10:49:21,315 W 932 932] global_state_accessor.cc:390: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
# .
# .
# .
# File "python/ray/includes/global_state_accessor.pxi", line 155, in ray._raylet.GlobalStateAccessor.get_node_to_connect_for_driver
# RuntimeError: b"This node has an IP address of 10.244.0.12, and Ray expects this IP address to be either the GCS address or one
# of the Raylet addresses. Connected to GCS at raycluster-kuberay-head-svc and found raylets at 10.244.0.7, 10.244.0.6
# but none of these match this node's IP 10.244.0.12. Are any of these actually a different IP address for the same node?
# You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node."
I am not an expert in ray debug
, but I will answer the following questions based on the above experiment.
Do you know if this is the intended behavior or should it be possible to
ray debug
from outside the head/worker nodes?
I will say no based on the error message in the experiment, but cc ray debug
experts @pcmoritz @rkooo567 to confirm.
Are all Ray CLI commands intended to be run from the head node or a worker node or is
ray debug
unique in this?
I believed that ray debug
is a special case. A lot of Ray CLI commands, e.g. ray job
, ray health-check
, can be running on a node that not register for GCS.
Thank you @kevin85421 ! By adding dashboard-host: "0.0.0.0"
to rayStartParams
in the CRD configuration file, @jeevb was able to get many of the Ray CLI commands to work from JupyterHub nodes in the same Kubernetes namespace as our Ray cluster.
debug
and logs
(possible related to this open issue)list
, memory
, and status
(there are probably many more that work from this node).I will update this comment if I find other CLI commands that only work from the head node.
Thank you @kevin85421 ! By adding
dashboard-host: "0.0.0.0"
torayStartParams
in the CRD configuration file, @jeevb was able to get many of the Ray CLI commands to work from JupyterHub nodes in the same Kubernetes namespace as our Ray cluster.
- Ray CLI commands that so far only work from the head node are
debug
andlogs
(possible related to this open issue)- Ray CLI commands that we've tested and work from outside the head node are
list
,memory
, andstatus
(there are probably many more that work from this node).I will update this comment if I find other CLI commands that only work from the head node.
Thank you! This is very helpful! cc @gvspraveen
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
Observation
ray debug --address=<ip:port>
results in UnicodeDecodeError on kubernetes clusterExpectation
ray debug --address=<ip:port>
would enable remote debugging the way it is shown to in the Ray documentationUseful Information
Reproduction script
simple_task.py
terminal 1
terminal 2
after running
ray debug --address=10.192.6.5:38031
terminal 1 shows thisAnything else
No response
Are you willing to submit a PR?