ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.26k stars 404 forks source link

[Bug] no feedback about failure to create submitter pod due to invalid spec #2210

Open mickvangelderen opened 4 months ago

mickvangelderen commented 4 months ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

created a RayJob with a submitterPodTemplate but no restartPolicy

had to search the logs of the ray-operator to find:

{"level":"error","ts":"2024-06-28T18:09:14.679Z","logger":"controllers.RayJob","msg":"failed to create k8s Job","RayJob":{"name":"mick-gxccf","namespace":"launch"},"reconcileID":"3b03831c-d14d-497f-9c8c-4ac790e1ff35","error":"Job.batch \"mick-gxccf\" is invalid: spec.template.spec.restartPolicy: Required value: valid values: \"OnFailure\", \"Never\"","stacktrace":"github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).createNewK8sJob\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayjob_controller.go:440\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).createK8sJobIfNeed\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayjob_controller.go:350\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayjob_controller.go:168\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}

I thought the RayJob spec is supposed to be validated on submission to the API? Is the validation not the same?

Reproduction script

"submitterPodTemplate": {
    "spec": {
        // "restartPolicy": "Never", <- OFFENDER
        // ... as usual
    }
}

Anything else

No response

Are you willing to submit a PR?

kevin85421 commented 4 months ago

We are currently working on improving the observability of CRD status. You can see this doc for more details.

kevin85421 commented 3 months ago

@rueian can you confirm whether this issue is closed by #2259? Thanks!

Update: I found the issue is about RayJob. Ignore this comment.

davidxia commented 2 months ago

You can see this doc for more details.

@kevin85421 can you link the doc? Thanks!

davidxia commented 2 months ago

I have the same issue except I didn't specify a name for my submitterPodTemplate. The controller failed to create the K8s Job.

{"level":"error","ts":"2024-08-28T19:56:13.072Z","logger":"controllers.RayJob","msg":"failed to create k8s Job","RayJob":{"name":"ray-job-s9rqx","namespace":"hyperkube"},"reconcileID":"f4eb1538-36dd-4082-8628-4f66c59fa497","error":"Job.batch \"ray-job-s9rqx\" is invalid: [spec.template.spec.containers[0].name: Required value, spec.template.spec.restartPolicy: Required value: valid values: \"OnFailure\", \"Never\"]","stacktrace":"github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).createNewK8sJob\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayjob_controller.go:440\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).createK8sJobIfNeed\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayjob_controller.go:350\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayjob_controller.go:168\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":"2024-08-28T19:56:13.072Z","logger":"controllers.RayJob","msg":"Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler","RayJob":{"name":"ray-job-s9rqx","namespace":"hyperkube"},"reconcileID":"f4eb1538-36dd-4082-8628-4f66c59fa497"}
{"level":"error","ts":"2024-08-28T19:56:13.072Z","logger":"controllers.RayJob","msg":"Reconciler error","RayJob":{"name":"ray-job-s9rqx","namespace":"hyperkube"},"reconcileID":"f4eb1538-36dd-4082-8628-4f66c59fa497","error":"Job.batch \"ray-job-s9rqx\" is invalid: [spec.template.spec.containers[0].name: Required value, spec.template.spec.restartPolicy: Required value: valid values: \"OnFailure\", \"Never\"]","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}

Is it expected that the RayJob will have a Deployment Status of Initializing forever?

kubectl get rayjobs ray-job-s9rqx

NAME            JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME   AGE
ray-job-s9rqx                Initializing        2024-08-28T19:54:10Z              9m40s

Describing the RayJob doesn't show any errors.

kubectl describe rayjobs ray-job-s9rqx

Status:
  Job Deployment Status:  Initializing
  Job Id:                 ray-job-s9rqx-crhbq
  Ray Cluster Name:       ray-job-s9rqx-raycluster-k9976
  Ray Cluster Status:
    Desired CPU:     0
    Desired GPU:     0
    Desired Memory:  0
    Desired TPU:     0
    Head:
  Start Time:  2024-08-28T19:54:10Z
Events:
  Type    Reason           Age    From                            Message
  ----    ------           ----   ----                            -------
  Normal  CreatedWorkload  9m18s  ray.io/rayjob-kueue-controller  Created Workload: hyperkube/rayjob-ray-job-s9rqx-0d702
  Normal  Started          9m17s  ray.io/rayjob-kueue-controller  Admitted by clusterQueue cluster-queue
  Normal  Created          9m17s  rayjob-controller               Created RayCluster ray-job-s9rqx-raycluster-k9976
davidxia commented 2 months ago

Any thoughts on adding a RayJob validating webhook to also have logic like this when creating or updating a RayJob? Will give earlier, more visible errors to users.

andrewsykim commented 1 month ago

I think KubeRay has taken a general stance to avoid webhooks as many users mentioned that company policies forbid webhooks, but maybe this is worth revisitng? @kevin85421 would have more insights on this.

I think much of the logic in validateRayJobSpec can be in a ValidatingAdmissionPolicy using CEL expressions. That could be worth considering too https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/

kevin85421 commented 1 month ago

@kevin85421 can you link the doc? Thanks!

Sorry, I forget to link the doc in the above comment. We typically traced the CRD observability by https://docs.google.com/document/d/1NCX3fCTiE817AV4chvR46bkNHR7q0zy4QXQfDu-dqF4/edit, and @rueian is leading the effort.

I think KubeRay has taken a general stance to avoid webhooks as many users mentioned that company policies forbid webhooks, but maybe this is worth revisitng?

At that time, we were mainly discussing CR conversion webhooks. We decided not to maintain a CR conversion webhook because (1) many companies are not allowed to install webhooks, and (2) maintaining a conversion webhook is expensive.

Although (1) is still true for a validation webhook, maintaining a validation webhook is much cheaper than maintaining a conversion webhook. I may not prioritize it, but PRs are welcome.

I think much of the logic in validateRayJobSpec can be in a ValidatingAdmissionPolicy using CEL expressions. That could be worth considering too https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/

I will read the doc and reply later. My current thought is that the ValidatingAdmissionPolicy API became 'stable' in K8s 1.30, which is quite new.

mickvangelderen commented 1 month ago

The original issue I posted here has, as far as I know, nothing to do with webhooks. To reproduce the issue you need to run kubectl apply -f config.yaml where config.yaml creates a RayJob with "restartPolicy": "Never" set for the submitter job.

davidxia commented 1 month ago

We typically traced the CRD observability by https://docs.google.com/document/d/1NCX3fCTiE817AV4chvR46bkNHR7q0zy4QXQfDu-dqF4/edit, and @rueian is leading the effort.

@kevin85421 thanks! I see from the doc KubeRay - RayCluster status improvement workstream that a "Draft is out" by Tina.

I also see the doc has "Add new events to RayJob controller: ~RayJobSpecValidated~" which links to KubeRay EventRecorder Guidelines. In here I see Success / Failure of RayJobSpecValidated event. But this PR doesn't add more visibility on RayJob spec validation errors on the invalid RayJob object itself. Are there plans to do so? cc @rueian

kevin85421 commented 1 month ago

The original issue I posted here has, as far as I know, nothing to do with webhooks. To reproduce the issue you need to run kubectl apply -f config.yaml where config.yaml creates a RayJob with "restartPolicy": "Never" set for the submitter job.

submitterPodTemplate is a PodTemplateSpec, and restartPolicy is not a required field for PodTemplate. Therefore, it will not be validated by the OpenAPI spec (PodTemplateSpec). However, it is invalid for K8s Job if restartPolicy is not set or is set to Always. Hence, it is checked by K8s API server (code).

Is it enough if we create a K8s event for the RayJob CR for this error? @mickvangelderen

I also see the doc has "Add new events to RayJob controller: RayJobSpecValidated" which links to KubeRay EventRecorder Guidelines. In here I see https://github.com/ray-project/kuberay/issues/1813. But this PR doesn't add more visibility on RayJob spec validation errors on the invalid RayJob object itself. Are there plans to do so?

@davidxia Before Ray Summit, we are focusing on the interactions between the KubeRay operator and the K8s API server, i.e. Create / Delete. We will focus on other events after the Ray Summit.

mickvangelderen commented 1 month ago

Thank you for the explanation. From your description I understand that when the KubeRay operator attempts to create a K8s job using the provided parameters the K8s API server rejects it and that the only record of this is currently in the KubeRay operator logs. I have no opinion about what a better feedback mechanism would look like.

I just want to mention that, without wanting to discourage you, solving this issue is not that important to me personally anymore. I know I may have to check the operator logs as part of my debugging sometimes. If there is a path forward where a feedback mechanism is introduced and some time is needed to apply it consistently to various errors that is totally understandable.