ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.3k stars 5.63k forks source link

[Core] ray-client timeout with ingress on kubernetes #38882

Open TheisFerre opened 1 year ago

TheisFerre commented 1 year ago

What happened + What you expected to happen

I am running a ray-cluster inside my kubernetes cluster. I have exposed the dashboard over HTTPS using an nginx ingress which works as expected.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/ssl-passthrough: "true"
  name: raycluster-ingress-head-ingress
  namespace: ray
spec:
  tls:
  - hosts:
    - dashboard.my.company.url.com
    secretName: ingress-tls
  rules:
  - host: dashboard.my.company.url.com
    http:
      paths:
      - path: /
        backend:
          service:
            name: ray-cluster-kuberay-head-svc
            port:
              number: 8265
        pathType: Prefix

With the dashboard exposed i can submit a simple hello-world job using the Ray CLI tool as follows:

ray job submit \ 
    --address=https://dashboard.my.company.url.com/ \
    --working-dir=scripts \
    -- python script.py
# script.py
import ray

@ray.remote
def hello_world():
    return "hello world"

ray.init()
print(ray.get(hello_world.remote()))

When reading the documentation, it seems that if i want to submit a job directly inside my python script (instead of using the CLI), i need to add the address of the ray-client inside the ray.init() method. Here i should specify the address to the ray-client running on the ray-cluster-head pod.

For this, i added an additional ingress resource that uses GRPC to target the ray-client. I followed the sample given here.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/ssl-passthrough: "true"
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
  name: raycluster-ingress-head-ingress-grpc
  namespace: ray
spec:
  tls:
  - hosts:
    - client.my.company.url.com
    secretName: ingress-tls
  rules:
  - host: client.my.company.url.com
    http:
      paths:
      - path: /
        backend:
          service:
            name: ray-cluster-kuberay-head-svc
            port:
              number: 10001
        pathType: Prefix

Running the modified script below with python script.py

# script.py
import ray

@ray.remote
def hello_world():
    return "hello world"

ray.init("ray://client.my.company.url.com")
print(ray.get(hello_world.remote()))

Results in connection timeout error.

ConnectionError: ray client connection timeout

Versions / Dependencies

ray-cluster helmchart version: 0.6.0 python: 3.11.4 Ray version installed in python environment: 2.6.3 kubernetes version 1.26.3

Reproduction script

# script.py
import ray

@ray.remote
def hello_world():
    return "hello world"

ray.init("ray://client.my.company.url.com")
print(ray.get(hello_world.remote()))

Issue Severity

High: It blocks me from completing my task.

jjyao commented 1 year ago

@TheisFerre,

Is it possible for you to use Ray job submission (https://docs.ray.io/en/releases-2.6.1/cluster/running-applications/job-submission/index.html) instead of Ray client (we don't recommend using ray client anymore)

TheisFerre commented 1 year ago

I am trying to connect to my ray-cluster with prefect, which means I will need to use the client.

Does your response mean that it is not possible to connect to the client through an ingress?