ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.21k stars 5.81k forks source link

[Core] Cannot Connect to Head Node GCS Through URL #33428

Open RishabhMalviya opened 1 year ago

RishabhMalviya commented 1 year ago

What happened + What you expected to happen

  1. I am working in a managed Kubernetes environment. We have three nodes (managed K8S Deployment + Service + Ingress) setup - one head node, and two worker nodes. Using the Service and Ingress configurations, I expose port 8265 of my container through the (internal) URL http://head-node-dashboard.company.internal.domain.com, and 6379 through http://head-node-gcs.company.internal.domain.com.

When I try to submit jobs to the dashboard URL, everything works fine:

ray job submit --working-dir ./ --address='http://head-node-dashboard.company.internal.domain.com' -- python ./script.py

But when I try to connect to the GCS, it fails. There are two ways that this happens:

  1. The worker to head node connection should work with the URL specified. It works if I give the local IP of the head node:
    
    $ > ray init --address='10.251.222.100:6379'
    Local node IP: 10.251.222.101
    2023-03-18 07:20:41,943 WARNING services.py:1791 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=2.47gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
    [2023-03-18 07:20:41,964 I 115596 115596] global_state_accessor.cc:356: This node has an IP address of 10.251.222.101, while we can not find the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.

Ray runtime started.

To terminate the Ray runtime, run ray stop

This is the behavior I'm hoping to get from the command `ray init --address='head-node-gcs.company.internal.domain.com:80'`.

3. This is the relevant part of the Service config of the head node:
```yaml
  ports:
  - name: ray-dashboard
    port: 8265
    targetPort: 8265
    protocol: TCP
  - name: ray-gcs
    port: 6379
    targetPort: 6379
    protocol: TCP
  - name: ray-client
    port: 10001
    targetPort: 10001
    protocol: TCP
  - name: ray-serve
    port: 8000
    targetPort: 8000
    protocol: TCP
  type: ClusterIP

This is the relevant part of the Ingress config of the head node:

spec:
  rules:
  - host: head-node-dashboard.company.internal.domain.com
    http:
      paths:
      - path: /
        backend:
          serviceName: head-node-svc
          servicePort: 8265
  - host: head-node-gcs.company.internal.domain.com
    http:
      paths:
      - path: /
        backend:
          serviceName: head-node-svc
          servicePort: 6379
  - host: head-node-client.company.internal.domain.com
    http:
      paths:
      - path: /
        backend:
          serviceName: head-node-svc
          servicePort: 10001
  - host: head-node-serve.company.internal.domain.com
    http:
      paths:
      - path: /
        backend:
          serviceName: head-node-svc
          servicePort: 8000

Versions / Dependencies

$ > ray --version
ray, version 2.3.0

$ > python --version
Python 3.7.4

$ > uname -a
Linux head-node-659568794c-rwmpk 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 GNU/Linux

Reproduction script

I don't think this is reproducible since I'm running this in a managed Kubernetes environment. But, the Service and Ingress configuration snippets provided above should help setup the basic networking.

Issue Severity

High: It blocks me from completing my task.

jjyao commented 1 year ago

Can head-node-gcs.company.internal.domain.com be resolved to the correct head node ip on the worker node?

RishabhMalviya commented 1 year ago

Networking isn't my strong suite. How can I check that?

jjyao commented 1 year ago

Try dig +short head-node-gcs.company.internal.domain.com?

RishabhMalviya commented 1 year ago

Ok so, we can't install dig in the company infra. But I ran the ping command on the worker node and got the following (it's working):

$ > ping head-node-gcs.company.internal.domain.com
PING head-node-gcs.company.internal.domain.com (10.61.191.26) 56(84) bytes of data.
64 bytes from 10.61.191.26 (10.61.191.26): icmp_seq=1 ttl=62 time=0.488 ms
64 bytes from 10.61.191.26 (10.61.191.26): icmp_seq=2 ttl=62 time=0.229 ms
64 bytes from 10.61.191.26 (10.61.191.26): icmp_seq=3 ttl=62 time=0.243 ms
64 bytes from 10.61.191.26 (10.61.191.26): icmp_seq=4 ttl=62 time=0.235 ms
jjyao commented 1 year ago

@RishabhMalviya,

Could you run the following python code on the worker node

import ray
gcs_client = ray._private.gcs_utils.GcsClient(address="head-node-gcs.company.internal.domain.com:80")
gcs_client.get_all_node_info()

and paste the output?

RishabhMalviya commented 1 year ago

Sure, this is the output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/ray/_private/gcs_utils.py", line 198, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/ray/_private/gcs_utils.py", line 399, in get_all_node_info
    reply = self._node_info_stub.GetAllNodeInfo(req, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses; last error: INTERNAL: Trying to connect an http1.x server"
        debug_error_string = "UNKNOWN:Failed to pick subchannel {created_time:"2023-03-30T07:00:15.555560522+09:00", children:[UNKNOWN:failed to connect to all addresses; last error: INTERNAL: Trying to connect an http1.x server {created_time:"2023-03-30T07:00:15.555559305+09:00", grpc_status:14}]}"
jjyao commented 1 year ago

@RishabhMalviya actually the port seems wrong: is 80 your gcs port, seems it should be 6379? head-node-gcs.company.internal.domain.com:6379

RishabhMalviya commented 1 year ago
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/ray/_private/gcs_utils.py", line 198, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/ray/_private/gcs_utils.py", line 399, in get_all_node_info
    reply = self._node_info_stub.GetAllNodeInfo(req, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused"
        debug_error_string = "UNKNOWN:Failed to pick subchannel {created_time:"2023-03-31T07:57:52.183814056+09:00", children:[UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2023-03-31T07:57:52.183812008+09:00"}]}"
jjyao commented 1 year ago

Could you run

import ray
gcs_client = ray._private.gcs_utils.GcsClient(address="head-node-gcs.company.internal.domain.com:6379")
gcs_client.get_all_node_info()
gcs_client = ray._private.gcs_utils.GcsClient(address="IP:6379")
gcs_client.get_all_node_info()
ping head-node-gcs.company.internal.domain.com
ray init --address='head-node-gcs.company.internal.domain.com:6379'
ray init --address='IP:6379'

I'd like to see the output

charu-vl commented 1 year ago

Hi @RishabhMalviya, @jjyao, were you able to find a solution to this? I'm running into a similar issue with trying to run a ray serve application on a remote cluster from my local machine, but even submitting jobs from the dashboard url doesn't seem to work, although I am able to see the dashboard just fine if I go to the dashboard url. My workaround right now is to forward the port from the kubectl and then submit a job, like the following:

kubectl port-forward service/raycluster-service-name-head-svc 8265:8265
ray job submit --address http://localhost:8265 --working-dir="./" -- serve run --host="0.0.0.0" --working-dir="./" --non-blocking model_file:model

The strange thing was that it was working fine about 1.5 weeks ago, but then I came back to the cluster yesterday and I received this error. Deleting the cluster and creating a new one didn't seem to help

jjyao commented 1 year ago

@charu-vl what's the error you are receiving?

RishabhMalviya commented 1 year ago

@jjyao Hey man, my company infra was facing some issues for the about a week and a half, so I did not get a chance to look at this, and then it kind of faded to the background. Anyway, I recently found a work-around (@charu-vl):

In Kubernetes, every component in the cluster has access to a Kubernetes DNS (https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/) which lets you reference other components using their Kubernetes name. So, for connecting to my head node, I used the Kubernetes service name instead of using the URL, and it worked: ray start --address='<head-node-service-name>:6379'

I believe the issue was that K8S Ingress only manages communication over HTTP/HTTPS (makes sense, since it does a bunch of other things specific to HTTP/HTTPS under-the-hood, like load-balancing, TLS authentication, etc.). Ray nodes on the other hand, interact with each other over the RPC protocol. Since the head node's URL was exposed through K8S Ingress, Ray inter-node connections through the URL failed.

NOTE: If you have access to the K8S configuration .yaml file for you head node's K8S Service, you can check the name that you should use for connecting to it under metadata -> name.

kevin85421 commented 1 year ago

Hi @RishabhMalviya @charu-vl, you can also consider KubeRay as a solution for Ray on Kubernetes. KubeRay also uses ray start --address='<head-node-service-name>:6379' in workers to connect with the head Pod. The only difference is that KubeRay using FQDN rather than Kubernetes service name. See here for more details. Thanks!

RishabhMalviya commented 1 year ago

@kevin85421 Yes, we looked into that. But we are using Kubernetes in a company-managed infrastructure with a large number of restrictions due to compliance and security. Because of this, we have no access to kubectl, and do not have the ability to install Kubernetes Operators.

kevin85421 commented 1 year ago

@RishabhMalviya Got it. Just out of curiosity, which tool can be used to access Kubernetes resources without kubectl? I have also heard that some companies are unable to install CRDs in their Kubernetes clusters.

RishabhMalviya commented 1 year ago

@kevin85421 So, we don't have any alternative to kubectl per se. The way our company has setup K8S is very opinionated.

They've built a UI that we can access for creating new 'apps'. These 'apps' are essentially a K8S Deployment (pod) + a K8S Service + a K8S Ingress component. The UI also allows us to specify mounted storage locations and hardware resource allocation.

Once the 'app' is created though, we do get access to the underlying K8S configuration .yaml for the three components.

kevin85421 commented 1 year ago

@RishabhMalviya Got it. Thank you for sharing!

jednymslowem commented 1 year ago

Hi @RishabhMalviya, @jjyao, were you able to find a solution to this? I'm running into a similar issue with trying to run a ray serve application on a remote cluster from my local machine, but even submitting jobs from the dashboard url doesn't seem to work, although I am able to see the dashboard just fine if I go to the dashboard url. My workaround right now is to forward the port from the kubectl and then submit a job, like the following:

kubectl port-forward service/raycluster-service-name-head-svc 8265:8265
ray job submit --address http://localhost:8265 --working-dir="./" -- serve run --host="0.0.0.0" --working-dir="./" --non-blocking model_file:model

The strange thing was that it was working fine about 1.5 weeks ago, but then I came back to the cluster yesterday and I received this error. Deleting the cluster and creating a new one didn't seem to help

This workaround actually worked for me.

Otherwise I experience exactly the same issue @RishabhMalviya described. I tried Ray 2.5.1 and 3.0.0.dev0. I can serve run directly form the head node using kubectl exec --tty, but kubectl port-forward and exposing the dashboard port using an ingress prevent ray.init from connecting to the cluster somehow.

marrrcin commented 1 year ago

It's still a valid issue on Ray 2.6.3 (I'm using kuberay-operator-0.5.0 Helm Chart on GKE). Just a basic setup as show in the quick start guide with port-forward and the code (run in a notebook):

import ray
ray.init("127.0.0.1:8266")

fails with:

2023-08-30 13:01:44,006 INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 127.0.0.1:8266...
2023-08-30 13:01:49,650 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-08-30 13:01:49,651 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 127.0.0.1:8266. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.

@jjyao - could you provide some update on this?

omriel1 commented 1 year ago

I'm facing the same issue as @marrrcin @jednymslowem described. I've Used the Getting Started Guide to set up ray cluster on KiND. Did the required port-forwarding, and set RAY_ADDRESS env variable to point to the right url (I'm able to access the dashboard) and I get the same error:

2023-09-16 01:16:49,657 INFO worker.py:1313 -- Using address 127.0.0.1:8265 set in the environment variable RAY_ADDRESS
2023-09-16 01:16:49,658 INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 127.0.0.1:8265...
2023-09-16 01:16:54,857 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-09-16 01:16:54,857 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 127.0.0.1:8265. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.

I would very appreciate any help, currently can't figure this out.

lonsdale8734 commented 1 year ago

I run RayCluster on k8s and access ray client server from ingress with ssl:

import grpc
import ray

ray.init(address="ray://xxxx:443", _credentials=grpc.ssl_channel_credentials())

But I run into other problem:

E1012 16:43:07.559710570 2789667 hpack_parser.cc:833]                  Error parsing 'content-type' metadata: error=invalid value key=content-type
E1012 16:43:07.637759904 2789667 hpack_parser.cc:833]                  Error parsing 'content-type' metadata: error=invalid value key=content-type
E1012 16:43:07.749863840 2789674 hpack_parser.cc:833]                  Error parsing 'content-type' metadata: error=invalid value key=content-type
E1012 16:43:07.877639829 2789674 hpack_parser.cc:833]                  Error parsing 'content-type' metadata: error=invalid value key=content-type
2023-10-12 16:43:12,903 WARNING dataclient.py:403 -- Encountered connection issues in the data channel. Attempting to reconnect.
Log channel is reconnecting. Logs produced while the connection was down can be found on the head node of the cluster in `ray_client_server_[port].out`

 raise self._exception
ConnectionError: Failed during this or a previous request. Exception that broke the connection: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.NOT_FOUND
        details = "Attempted to reconnect to a session that has already been cleaned up."
        debug_error_string = "UNKNOWN:Error received from peer ipv4:10.8.8.50:443 {created_time:"2023-10-12T16:43:59.571742914+08:00", grpc_status:5, grpc_message:"Attempted to reconnect to a session that has al
ready been cleaned up."}"
>