ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.29k stars 412 forks source link

[autoscaler] Autoscaler can not know the overwritten port number. #1644

Open surenyufuz opened 1 year ago

surenyufuz commented 1 year ago

What happened + What you expected to happen

I have deployed a ray cluster on kubernetes and specify the port "61379" instead of "6379" in ray start params.

  rayStartParams:
    node-ip-address: $MY_POD_IP
    dashboard-host: "0.0.0.0"
    ray-client-server-port: "61001"
    dashboard-port: "61265"
    **port: "61379"**
    metrics-export-port: "61080"  
    include-dashboard: "true"
    min-worker-port: "10002"  
    max-worker-port: "19999"
    num-cpus: "0"  

It appears that the head service works well.

企业微信截图_f16c73ae-721e-46a2-8142-dad5c893c759 企业微信截图_6338133a-eb22-48aa-a798-1f1f967cc078

And the workers could connect to the head node successfully.

But the autoscaler container in the head node encountered the exception as following:

企业微信截图_2c3fb70c-41ba-4886-a13d-261f17810d85

It seems like that the autoscaler can not know the overwritten port number.

Versions / Dependencies

ray version: 2.7.1 Python version: 3.8.13

Reproduction script

apiVersion: ray.io/v1alpha1
kind: RayCluster
**metadata:**
  name: test
  namespace: test
**spec:**
  rayVersion: "2.7.1"
  **autoscalerOptions:**
    upscalingMode: Default
    idleTimeoutSeconds: 60
    imagePullPolicy: IfNotPresent
    securityContext: { }
    env: [ ]
    envFrom: [ ]
    resources:
      limits:
        cpu: "500m"
        memory: "512Mi"
      requests:
        cpu: "500m"
        memory: "512Mi"
  **headGroupSpec:**
    serviceType: ClusterIP  # Options are ClusterIP, NodePort, and LoadBalancer
    **rayStartParams:**
      node-ip-address: $MY_POD_IP
      dashboard-host: "0.0.0.0"
      ray-client-server-port: "61001"
      dashboard-port: "61265"
      port: "61379"
      metrics-export-port: "61080"  # not random by default on k8s, default to 8080
      include-dashboard: "true"
      min-worker-port: "10002"   # set random port range
      max-worker-port: "19999"
      num-cpus: "0"  # prevent scheduling workloads to head
    template: # Pod template
        metadata: # Pod metadata
        spec: # Pod spec
            nodeSelector:
              app-type.tf: "true"
            containers:
            - name: ray-head
              image: "ray-2.7.1"
              env:
                - name: MY_POD_IP
                  valueFrom:
                    fieldRef:
                      fieldPath: status.podIP
              resources:
                limits:
                  cpu: 4
                  memory: 8Gi
                requests:
                  cpu: 4
                  memory: 8Gi
              # Keep this preStop hook in each Ray container config.
              lifecycle:
                preStop:
                  exec:
                    command: ["/bin/sh","-c","ray stop"]
              ports: # Optional service port overrides
              - containerPort: 61379
                name: gcs # gcs
              - containerPort: 61265
                name: dashboard
              - containerPort: 61001
                name: head
              - containerPort: 61080  # not random by default on k8s
                name: metrics
  **workerGroupSpecs:**
    - groupName: small-group
      replicas: 1
      minReplicas: 1
      maxReplicas: 5
      **rayStartParams:**
          node-ip-address: $MY_POD_IP
          metrics-export-port: "61081"  # not random by default on k8s, default to 8080
          min-worker-port: "10002"
          max-worker-port: "19999"
      template: # Pod template
        spec:
          nodeSelector:
            app-type.tf: "true"
          containers:
            - name: ray-worker
              image: "ray-2.7.1"
              env:
                - name: MY_POD_IP
                  valueFrom:
                    fieldRef:
                      fieldPath: status.podIP
              resources:
                limits:
                  cpu: 4
                  memory: 8Gi
                requests:
                  cpu: 4
                  memory: 8Gi
              # Keep this preStop hook in each Ray container config.
              lifecycle:
                preStop:
                  exec:
                    command: [ "/bin/sh","-c","ray stop" ]
              ports: # Optional service port overrides
                - containerPort: 61081
                  name: metrics

Issue Severity

High: It blocks me from completing my task.

kevin85421 commented 1 year ago

Hi @surenyufuz, thank you for raising this issue! Is it possible to use the default port 6379 as a workaround before the Ray community fixes this issue?

surenyufuz commented 1 year ago

Thanks for attention, maybe I will not use autoscaler on kubernetes until this issue is fixed. I have to use random port with hostnetwork for some situations.

surenyufuz commented 1 year ago

As is recommended to use autoscaler for GPU workloads, I expect this problem to be resolved, thanks a lot.

hangg112233 commented 2 months ago

Also running into this same issue. Unfortunately, in our case, we are not able to use the default port 6379, as this is also the default Redis port, and we have some special routing configs for that port that's incompatible with Ray.