ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
963 stars 328 forks source link

[Bug] no feedback about failure to create submitter pod due to invalid spec #2210

Open mickvangelderen opened 2 days ago

mickvangelderen commented 2 days ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

created a RayJob with a submitterPodTemplate but no restartPolicy

had to search the logs of the ray-operator to find:

{"level":"error","ts":"2024-06-28T18:09:14.679Z","logger":"controllers.RayJob","msg":"failed to create k8s Job","RayJob":{"name":"mick-gxccf","namespace":"launch"},"reconcileID":"3b03831c-d14d-497f-9c8c-4ac790e1ff35","error":"Job.batch \"mick-gxccf\" is invalid: spec.template.spec.restartPolicy: Required value: valid values: \"OnFailure\", \"Never\"","stacktrace":"github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).createNewK8sJob\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayjob_controller.go:440\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).createK8sJobIfNeed\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayjob_controller.go:350\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayjob_controller.go:168\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}

I thought the RayJob spec is supposed to be validated on submission to the API? Is the validation not the same?

Reproduction script

"submitterPodTemplate": {
    "spec": {
        // "restartPolicy": "Never", <- OFFENDER
        // ... as usual
    }
}

Anything else

No response

Are you willing to submit a PR?