Closed kwohlfahrt closed 1 month ago
ray-operator
When I create a RayJob resource that overrides app.kubernetes.io/name, the job is never launched. The following error is logged by the operator:
RayJob
app.kubernetes.io/name
2024-05-14T11:50:09.614Z ERROR controller.raycluster-controller Reconciler error {"reconciler group": "ray.io", "reconciler kind": "RayCluster", "name": "foo-raycluster-fxnq5", "namespace": "default", "error": "unable to find head service. cluster name foo-raycluster-fxnq5, filter labels map[app.kubernetes.io/created-by:kuberay-operator app.kubernetes.io/name:kuberay ray.io/cluster:foo-raycluster-fxnq5 ray.io/identifier:foo-raycluster-fxnq5-head ray.io/node-type:head]"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
I create the following RayJob. The cluster starts, but the job is never launched:
apiVersion: ray.io/v1 kind: RayJob metadata: name: foo labels: app.kubernetes.io/name: foo spec: entrypoint: ray status rayClusterSpec: rayVersion: 2.9.3 autoscalerOptions: imagePullPolicy: Always enableInTreeAutoscaling: true headGroupSpec: rayStartParams: dashboard-host: 0.0.0.0 serviceType: ClusterIP template: metadata: labels: app.kubernetes.io/name: foo spec: containers: - name: ray-head image: rayproject/ray:2.9.3 imagePullPolicy: Always lifecycle: preStop: exec: command: ["/bin/sh", "-c", "ray stop"] resources: limits: cpu: 500m memory: 1Gi workerGroupSpecs: - groupName: worker maxReplicas: 1 minReplicas: 0 replicas: 0 rayStartParams: {} template: metadata: labels: app.kubernetes.io/name: foo spec: containers: - name: ray-worker image: rayproject/ray:2.9.3 lifecycle: preStop: exec: command: ["/bin/sh", "-c", "ray stop"] resources: limits: cpu: 500m memory: 1Gi
If I omit the labels, then the RayJob works as expected.
labels
IMO, the operator should not be relying on labels outside the ray.io/ namespace for anything internal, as users expect to be able to override these.
ray.io/
Interestingly, it looks like the service selector correctly respects the new labels, it's just on the operator side where it fails.
Hi @kevin85421, I will take this.
Closed by #2166
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
When I create a
RayJob
resource that overridesapp.kubernetes.io/name
, the job is never launched. The following error is logged by the operator:Reproduction script
I create the following
RayJob
. The cluster starts, but the job is never launched:If I omit the
labels
, then theRayJob
works as expected.Anything else
IMO, the operator should not be relying on labels outside the
ray.io/
namespace for anything internal, as users expect to be able to override these.Are you willing to submit a PR?