ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
974 stars 330 forks source link

[Bug] RayJob does not work when `app.kubernetes.io/name` is set #2147

Closed kwohlfahrt closed 1 month ago

kwohlfahrt commented 1 month ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

When I create a RayJob resource that overrides app.kubernetes.io/name, the job is never launched. The following error is logged by the operator:

2024-05-14T11:50:09.614Z    ERROR   controller.raycluster-controller    Reconciler error    {"reconciler group": "ray.io", "reconciler kind": "RayCluster", "name": "foo-raycluster-fxnq5", "namespace": "default", "error": "unable to find head service. cluster name foo-raycluster-fxnq5, filter labels map[app.kubernetes.io/created-by:kuberay-operator app.kubernetes.io/name:kuberay ray.io/cluster:foo-raycluster-fxnq5 ray.io/identifier:foo-raycluster-fxnq5-head ray.io/node-type:head]"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227

Reproduction script

I create the following RayJob. The cluster starts, but the job is never launched:

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: foo
  labels:
    app.kubernetes.io/name: foo
spec:
  entrypoint: ray status
  rayClusterSpec:
    rayVersion: 2.9.3
    autoscalerOptions:
      imagePullPolicy: Always
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams:
        dashboard-host: 0.0.0.0
      serviceType: ClusterIP
      template:
        metadata:
          labels:
            app.kubernetes.io/name: foo
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.9.3
              imagePullPolicy: Always
              lifecycle:
                preStop:
                  exec:
                    command: ["/bin/sh", "-c", "ray stop"]
              resources:
                limits:
                  cpu: 500m
                  memory: 1Gi
    workerGroupSpecs:
      - groupName: worker
        maxReplicas: 1
        minReplicas: 0
        replicas: 0
        rayStartParams: {}
        template:
          metadata:
            labels:
              app.kubernetes.io/name: foo
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.9.3
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh", "-c", "ray stop"]
                resources:
                  limits:
                    cpu: 500m
                    memory: 1Gi

If I omit the labels, then the RayJob works as expected.

Anything else

IMO, the operator should not be relying on labels outside the ray.io/ namespace for anything internal, as users expect to be able to override these.

Are you willing to submit a PR?

kwohlfahrt commented 1 month ago

Interestingly, it looks like the service selector correctly respects the new labels, it's just on the operator side where it fails.

rueian commented 1 month ago

Hi @kevin85421, I will take this.

kevin85421 commented 1 month ago

Closed by #2166