Open mickvangelderen opened 5 months ago
We are currently working on improving the observability of CRD status. You can see this doc for more details.
@rueian can you confirm whether this issue is closed by #2259? Thanks!
Update: I found the issue is about RayJob. Ignore this comment.
You can see this doc for more details.
@kevin85421 can you link the doc? Thanks!
I have the same issue except I didn't specify a name
for my submitterPodTemplate
. The controller failed to create the K8s Job.
{"level":"error","ts":"2024-08-28T19:56:13.072Z","logger":"controllers.RayJob","msg":"failed to create k8s Job","RayJob":{"name":"ray-job-s9rqx","namespace":"hyperkube"},"reconcileID":"f4eb1538-36dd-4082-8628-4f66c59fa497","error":"Job.batch \"ray-job-s9rqx\" is invalid: [spec.template.spec.containers[0].name: Required value, spec.template.spec.restartPolicy: Required value: valid values: \"OnFailure\", \"Never\"]","stacktrace":"github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).createNewK8sJob\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayjob_controller.go:440\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).createK8sJobIfNeed\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayjob_controller.go:350\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayjob_controller.go:168\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":"2024-08-28T19:56:13.072Z","logger":"controllers.RayJob","msg":"Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler","RayJob":{"name":"ray-job-s9rqx","namespace":"hyperkube"},"reconcileID":"f4eb1538-36dd-4082-8628-4f66c59fa497"}
{"level":"error","ts":"2024-08-28T19:56:13.072Z","logger":"controllers.RayJob","msg":"Reconciler error","RayJob":{"name":"ray-job-s9rqx","namespace":"hyperkube"},"reconcileID":"f4eb1538-36dd-4082-8628-4f66c59fa497","error":"Job.batch \"ray-job-s9rqx\" is invalid: [spec.template.spec.containers[0].name: Required value, spec.template.spec.restartPolicy: Required value: valid values: \"OnFailure\", \"Never\"]","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
Is it expected that the RayJob will have a Deployment Status of Initializing forever?
kubectl get rayjobs ray-job-s9rqx
NAME JOB STATUS DEPLOYMENT STATUS START TIME END TIME AGE
ray-job-s9rqx Initializing 2024-08-28T19:54:10Z 9m40s
Describing the RayJob doesn't show any errors.
kubectl describe rayjobs ray-job-s9rqx
Status:
Job Deployment Status: Initializing
Job Id: ray-job-s9rqx-crhbq
Ray Cluster Name: ray-job-s9rqx-raycluster-k9976
Ray Cluster Status:
Desired CPU: 0
Desired GPU: 0
Desired Memory: 0
Desired TPU: 0
Head:
Start Time: 2024-08-28T19:54:10Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal CreatedWorkload 9m18s ray.io/rayjob-kueue-controller Created Workload: hyperkube/rayjob-ray-job-s9rqx-0d702
Normal Started 9m17s ray.io/rayjob-kueue-controller Admitted by clusterQueue cluster-queue
Normal Created 9m17s rayjob-controller Created RayCluster ray-job-s9rqx-raycluster-k9976
Any thoughts on adding a RayJob validating webhook to also have logic like this when creating or updating a RayJob? Will give earlier, more visible errors to users.
I think KubeRay has taken a general stance to avoid webhooks as many users mentioned that company policies forbid webhooks, but maybe this is worth revisitng? @kevin85421 would have more insights on this.
I think much of the logic in validateRayJobSpec
can be in a ValidatingAdmissionPolicy using CEL expressions. That could be worth considering too https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/
@kevin85421 can you link the doc? Thanks!
Sorry, I forget to link the doc in the above comment. We typically traced the CRD observability by https://docs.google.com/document/d/1NCX3fCTiE817AV4chvR46bkNHR7q0zy4QXQfDu-dqF4/edit, and @rueian is leading the effort.
I think KubeRay has taken a general stance to avoid webhooks as many users mentioned that company policies forbid webhooks, but maybe this is worth revisitng?
At that time, we were mainly discussing CR conversion webhooks. We decided not to maintain a CR conversion webhook because (1) many companies are not allowed to install webhooks, and (2) maintaining a conversion webhook is expensive.
Although (1) is still true for a validation webhook, maintaining a validation webhook is much cheaper than maintaining a conversion webhook. I may not prioritize it, but PRs are welcome.
I think much of the logic in validateRayJobSpec can be in a ValidatingAdmissionPolicy using CEL expressions. That could be worth considering too https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/
I will read the doc and reply later. My current thought is that the ValidatingAdmissionPolicy
API became 'stable' in K8s 1.30, which is quite new.
The original issue I posted here has, as far as I know, nothing to do with webhooks. To reproduce the issue you need to run kubectl apply -f config.yaml
where config.yaml
creates a RayJob with "restartPolicy": "Never"
set for the submitter job.
We typically traced the CRD observability by https://docs.google.com/document/d/1NCX3fCTiE817AV4chvR46bkNHR7q0zy4QXQfDu-dqF4/edit, and @rueian is leading the effort.
@kevin85421 thanks! I see from the doc KubeRay - RayCluster status improvement workstream that a "Draft is out" by Tina.
I also see the doc has "Add new events to RayJob controller: ~RayJobSpecValidated~" which links to KubeRay EventRecorder Guidelines. In here I see Success / Failure of RayJobSpecValidated event. But this PR doesn't add more visibility on RayJob spec validation errors on the invalid RayJob object itself. Are there plans to do so? cc @rueian
The original issue I posted here has, as far as I know, nothing to do with webhooks. To reproduce the issue you need to run kubectl apply -f config.yaml where config.yaml creates a RayJob with "restartPolicy": "Never" set for the submitter job.
submitterPodTemplate
is a PodTemplateSpec
, and restartPolicy
is not a required field for PodTemplate
. Therefore, it will not be validated by the OpenAPI spec (PodTemplateSpec
). However, it is invalid for K8s Job if restartPolicy
is not set or is set to Always. Hence, it is checked by K8s API server (code).
Is it enough if we create a K8s event for the RayJob CR for this error? @mickvangelderen
I also see the doc has "Add new events to RayJob controller: RayJobSpecValidated" which links to KubeRay EventRecorder Guidelines. In here I see https://github.com/ray-project/kuberay/issues/1813. But this PR doesn't add more visibility on RayJob spec validation errors on the invalid RayJob object itself. Are there plans to do so?
@davidxia Before Ray Summit, we are focusing on the interactions between the KubeRay operator and the K8s API server, i.e. Create / Delete. We will focus on other events after the Ray Summit.
Thank you for the explanation. From your description I understand that when the KubeRay operator attempts to create a K8s job using the provided parameters the K8s API server rejects it and that the only record of this is currently in the KubeRay operator logs. I have no opinion about what a better feedback mechanism would look like.
I just want to mention that, without wanting to discourage you, solving this issue is not that important to me personally anymore. I know I may have to check the operator logs as part of my debugging sometimes. If there is a path forward where a feedback mechanism is introduced and some time is needed to apply it consistently to various errors that is totally understandable.
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
created a RayJob with a
submitterPodTemplate
but norestartPolicy
had to search the logs of the ray-operator to find:
I thought the RayJob spec is supposed to be validated on submission to the API? Is the validation not the same?
Reproduction script
Anything else
No response
Are you willing to submit a PR?