ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
974 stars 330 forks source link

[Bug] "unable to find head service" error when specifying app.kubernetes.io/name on headGroupSpec #2151

Closed jonapgar-groupby closed 1 month ago

jonapgar-groupby commented 1 month ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

If you specify a app.kubernetes.io/name label (with a value other than "kuberay") in your headGroupSpec.template.metadata.labels for a RayJob, the ray-operator will not be able to find the head service, and the cluster will never have its status updated.

If you see "unable to find head service" errors in your logs, followed by a loop of "Wait for the RayCluster.Status.State to be ready before submitting the job" messages, it may be a similar error.

The error occurs because any custom app.kubernetes.io/name label will be also added to the service, but when ray-operator attempts to locate the service, it uses a filter that always looks for app.kubernetes.io/name: kuberay.

Reproduction script

cat <<EOF | kubectl apply -f -
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
spec:
  entrypoint: python /home/ray/samples/sample_code.py
  runtimeEnvYAML: |
    pip:
      - requests==2.26.0
      - pendulum==2.1.2
    env_vars:
      counter_name: "test_counter"
  rayClusterSpec:
    rayVersion: '2.9.0'
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
      template:
        metadata:
          labels:
            app.kubernetes.io/name: "TEST"
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.9.0
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "1"
                requests:
                  cpu: "200m"
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          volumes:
            - name: code-sample
              configMap:
                name: ray-job-code-sample
                items:
                  - key: sample_code.py
                    path: sample_code.py
EOF

Anything else

No response

Are you willing to submit a PR?

kwohlfahrt commented 1 month ago

I think this is a duplicate of #2147?

kevin85421 commented 1 month ago

@jonapgar-groupby thank you for reporting the issue! This is a duplicate of #2147. Track the progress in #2147.

jonapgar-groupby commented 1 month ago

oh weird I didn't see that one after looking into this issue for a few hours! good to know it's being tracked :) thanks!

On Fri, May 17, 2024, 1:52 p.m. Kai-Hsun Chen @.***> wrote:

@jonapgar-groupby https://github.com/jonapgar-groupby thank you for reporting the issue! This is a duplicate of #2147 https://github.com/ray-project/kuberay/issues/2147. Track the progress in #2147 https://github.com/ray-project/kuberay/issues/2147.

— Reply to this email directly, view it on GitHub https://github.com/ray-project/kuberay/issues/2151#issuecomment-2118118896, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6TRL4OE7QYIV2WZZYLGSCDZCY7XFAVCNFSM6AAAAABH2OYN22VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJYGEYTQOBZGY . You are receiving this because you were mentioned.Message ID: @.***>