Open RishabhMalviya opened 1 year ago
Can head-node-gcs.company.internal.domain.com
be resolved to the correct head node ip on the worker node?
Networking isn't my strong suite. How can I check that?
Try dig +short head-node-gcs.company.internal.domain.com
?
Ok so, we can't install dig in the company infra. But I ran the ping
command on the worker node and got the following (it's working):
$ > ping head-node-gcs.company.internal.domain.com
PING head-node-gcs.company.internal.domain.com (10.61.191.26) 56(84) bytes of data.
64 bytes from 10.61.191.26 (10.61.191.26): icmp_seq=1 ttl=62 time=0.488 ms
64 bytes from 10.61.191.26 (10.61.191.26): icmp_seq=2 ttl=62 time=0.229 ms
64 bytes from 10.61.191.26 (10.61.191.26): icmp_seq=3 ttl=62 time=0.243 ms
64 bytes from 10.61.191.26 (10.61.191.26): icmp_seq=4 ttl=62 time=0.235 ms
@RishabhMalviya,
Could you run the following python code on the worker node
import ray
gcs_client = ray._private.gcs_utils.GcsClient(address="head-node-gcs.company.internal.domain.com:80")
gcs_client.get_all_node_info()
and paste the output?
Sure, this is the output:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/site-packages/ray/_private/gcs_utils.py", line 198, in wrapper
return f(self, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/ray/_private/gcs_utils.py", line 399, in get_all_node_info
reply = self._node_info_stub.GetAllNodeInfo(req, timeout=timeout)
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: INTERNAL: Trying to connect an http1.x server"
debug_error_string = "UNKNOWN:Failed to pick subchannel {created_time:"2023-03-30T07:00:15.555560522+09:00", children:[UNKNOWN:failed to connect to all addresses; last error: INTERNAL: Trying to connect an http1.x server {created_time:"2023-03-30T07:00:15.555559305+09:00", grpc_status:14}]}"
@RishabhMalviya actually the port seems wrong: is 80 your gcs port, seems it should be 6379? head-node-gcs.company.internal.domain.com:6379
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/site-packages/ray/_private/gcs_utils.py", line 198, in wrapper
return f(self, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/ray/_private/gcs_utils.py", line 399, in get_all_node_info
reply = self._node_info_stub.GetAllNodeInfo(req, timeout=timeout)
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused"
debug_error_string = "UNKNOWN:Failed to pick subchannel {created_time:"2023-03-31T07:57:52.183814056+09:00", children:[UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2023-03-31T07:57:52.183812008+09:00"}]}"
Could you run
import ray
gcs_client = ray._private.gcs_utils.GcsClient(address="head-node-gcs.company.internal.domain.com:6379")
gcs_client.get_all_node_info()
gcs_client = ray._private.gcs_utils.GcsClient(address="IP:6379")
gcs_client.get_all_node_info()
ping head-node-gcs.company.internal.domain.com
ray init --address='head-node-gcs.company.internal.domain.com:6379'
ray init --address='IP:6379'
I'd like to see the output
Hi @RishabhMalviya, @jjyao, were you able to find a solution to this? I'm running into a similar issue with trying to run a ray serve application on a remote cluster from my local machine, but even submitting jobs from the dashboard url doesn't seem to work, although I am able to see the dashboard just fine if I go to the dashboard url. My workaround right now is to forward the port from the kubectl and then submit a job, like the following:
kubectl port-forward service/raycluster-service-name-head-svc 8265:8265
ray job submit --address http://localhost:8265 --working-dir="./" -- serve run --host="0.0.0.0" --working-dir="./" --non-blocking model_file:model
The strange thing was that it was working fine about 1.5 weeks ago, but then I came back to the cluster yesterday and I received this error. Deleting the cluster and creating a new one didn't seem to help
@charu-vl what's the error you are receiving?
@jjyao Hey man, my company infra was facing some issues for the about a week and a half, so I did not get a chance to look at this, and then it kind of faded to the background. Anyway, I recently found a work-around (@charu-vl):
In Kubernetes, every component in the cluster has access to a Kubernetes DNS (https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/) which lets you reference other components using their Kubernetes name. So, for connecting to my head node, I used the Kubernetes service name instead of using the URL, and it worked:
ray start --address='<head-node-service-name>:6379'
I believe the issue was that K8S Ingress only manages communication over HTTP/HTTPS (makes sense, since it does a bunch of other things specific to HTTP/HTTPS under-the-hood, like load-balancing, TLS authentication, etc.). Ray nodes on the other hand, interact with each other over the RPC protocol. Since the head node's URL was exposed through K8S Ingress, Ray inter-node connections through the URL failed.
NOTE: If you have access to the K8S configuration .yaml
file for you head node's K8S Service, you can check the name that you should use for connecting to it under metadata
-> name
.
Hi @RishabhMalviya @charu-vl, you can also consider KubeRay as a solution for Ray on Kubernetes. KubeRay also uses ray start --address='<head-node-service-name>:6379'
in workers to connect with the head Pod. The only difference is that KubeRay using FQDN rather than Kubernetes service name. See here for more details. Thanks!
@kevin85421 Yes, we looked into that. But we are using Kubernetes in a company-managed infrastructure with a large number of restrictions due to compliance and security. Because of this, we have no access to kubectl
, and do not have the ability to install Kubernetes Operators.
@RishabhMalviya Got it. Just out of curiosity, which tool can be used to access Kubernetes resources without kubectl
? I have also heard that some companies are unable to install CRDs in their Kubernetes clusters.
@kevin85421 So, we don't have any alternative to kubectl
per se. The way our company has setup K8S is very opinionated.
They've built a UI that we can access for creating new 'apps'. These 'apps' are essentially a K8S Deployment (pod) + a K8S Service + a K8S Ingress component. The UI also allows us to specify mounted storage locations and hardware resource allocation.
Once the 'app' is created though, we do get access to the underlying K8S configuration .yaml
for the three components.
@RishabhMalviya Got it. Thank you for sharing!
Hi @RishabhMalviya, @jjyao, were you able to find a solution to this? I'm running into a similar issue with trying to run a ray serve application on a remote cluster from my local machine, but even submitting jobs from the dashboard url doesn't seem to work, although I am able to see the dashboard just fine if I go to the dashboard url. My workaround right now is to forward the port from the kubectl and then submit a job, like the following:
kubectl port-forward service/raycluster-service-name-head-svc 8265:8265 ray job submit --address http://localhost:8265 --working-dir="./" -- serve run --host="0.0.0.0" --working-dir="./" --non-blocking model_file:model
The strange thing was that it was working fine about 1.5 weeks ago, but then I came back to the cluster yesterday and I received this error. Deleting the cluster and creating a new one didn't seem to help
This workaround actually worked for me.
Otherwise I experience exactly the same issue @RishabhMalviya described. I tried Ray 2.5.1 and 3.0.0.dev0. I can serve run
directly form the head node using kubectl exec --tty
, but kubectl port-forward
and exposing the dashboard port using an ingress prevent ray.init
from connecting to the cluster somehow.
It's still a valid issue on Ray 2.6.3 (I'm using kuberay-operator-0.5.0
Helm Chart on GKE). Just a basic setup as show in the quick start guide with port-forward and the code (run in a notebook):
import ray
ray.init("127.0.0.1:8266")
fails with:
2023-08-30 13:01:44,006 INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 127.0.0.1:8266...
2023-08-30 13:01:49,650 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-08-30 13:01:49,651 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 127.0.0.1:8266. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
@jjyao - could you provide some update on this?
I'm facing the same issue as @marrrcin @jednymslowem described. I've Used the Getting Started Guide to set up ray cluster on KiND. Did the required port-forwarding, and set RAY_ADDRESS
env variable to point to the right url (I'm able to access the dashboard) and I get the same error:
2023-09-16 01:16:49,657 INFO worker.py:1313 -- Using address 127.0.0.1:8265 set in the environment variable RAY_ADDRESS
2023-09-16 01:16:49,658 INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 127.0.0.1:8265...
2023-09-16 01:16:54,857 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-09-16 01:16:54,857 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 127.0.0.1:8265. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
I would very appreciate any help, currently can't figure this out.
I run RayCluster on k8s and access ray client server from ingress with ssl:
import grpc
import ray
ray.init(address="ray://xxxx:443", _credentials=grpc.ssl_channel_credentials())
But I run into other problem:
E1012 16:43:07.559710570 2789667 hpack_parser.cc:833] Error parsing 'content-type' metadata: error=invalid value key=content-type
E1012 16:43:07.637759904 2789667 hpack_parser.cc:833] Error parsing 'content-type' metadata: error=invalid value key=content-type
E1012 16:43:07.749863840 2789674 hpack_parser.cc:833] Error parsing 'content-type' metadata: error=invalid value key=content-type
E1012 16:43:07.877639829 2789674 hpack_parser.cc:833] Error parsing 'content-type' metadata: error=invalid value key=content-type
2023-10-12 16:43:12,903 WARNING dataclient.py:403 -- Encountered connection issues in the data channel. Attempting to reconnect.
Log channel is reconnecting. Logs produced while the connection was down can be found on the head node of the cluster in `ray_client_server_[port].out`
raise self._exception
ConnectionError: Failed during this or a previous request. Exception that broke the connection: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.NOT_FOUND
details = "Attempted to reconnect to a session that has already been cleaned up."
debug_error_string = "UNKNOWN:Error received from peer ipv4:10.8.8.50:443 {created_time:"2023-10-12T16:43:59.571742914+08:00", grpc_status:5, grpc_message:"Attempted to reconnect to a session that has al
ready been cleaned up."}"
>
What happened + What you expected to happen
http://head-node-dashboard.company.internal.domain.com
, and 6379 throughhttp://head-node-gcs.company.internal.domain.com
.When I try to submit jobs to the dashboard URL, everything works fine:
But when I try to connect to the GCS, it fails. There are two ways that this happens:
ray start
:ray.init()
.a) If I connect without any protocol defined:
b) If I connect with the
http://
protocol specified:c) If I connect with the
ray://
protocol specified:Ray runtime started.
To terminate the Ray runtime, run ray stop
This is the relevant part of the Ingress config of the head node:
Versions / Dependencies
Reproduction script
I don't think this is reproducible since I'm running this in a managed Kubernetes environment. But, the Service and Ingress configuration snippets provided above should help setup the basic networking.
Issue Severity
High: It blocks me from completing my task.