Open congpeiqing opened 7 months ago
The controller currently the item when there are errors during worker pod creation. It might be problematic to requeue regardless what the error is. If there are pod spec errors, then it should just fail instead of requeueing repeatedly.
https://github.com/kubeflow/mpi-operator/blob/master/pkg/controller/mpi_job_controller.go#L964
Ideally, we should have a webhook, but this was never prioritized.
Alternatively, we can add a CEL validator https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/#validation-expression
Happy to review a PR if you are interested in working on it.
Previously, I have tried to introduce CEL validation to the traininig-operator:
https://github.com/kubeflow/training-operator/issues/1708
However, I gave up introducing it since it is hard to validate podTemplate due to the cost budget of the CEL validations.
https://github.com/kubeflow/training-operator/issues/1708#issuecomment-1661876525
Hence, we must introduce webhooks if we want to validate the podTemplates.
You mean that CEL was too slow or what exactly?
You mean that CEL was too slow or what exactly?
No, I meant CEL validation can not work due to the following errors:
Forbidden: contributed to estimated rule cost total exceeding cost limit for entire OpenAPIv3 schema, spec.validation.openAPIV3Schema: Forbidden: x-kubernetes-validations estimated rule cost total for entire OpenAPIv3 schema exceeds budget by factor of more than 100x (try simplifying the rule, or adding maxItems, maxProperties, and maxLength where arrays, maps, and strings are declared)]
This was caused by cost budget.
Oh, so too many validation rules :)
Oh, so too many validation rules :)
I guess that these exceedings are caused by replicaSpecs are defined by map
because we can not set a limitation of the number of replicas and the search depth is infinity :(
Ah, we shot ourselves in the foot by using a map instead of explicit fields.
The controller currently the item when there are errors during worker pod creation. It might be problematic to requeue regardless what the error is. If there are pod spec errors, then it should just fail instead of requeueing repeatedly.
https://github.com/kubeflow/mpi-operator/blob/master/pkg/controller/mpi_job_controller.go#L964
https://github.com/kubeflow/mpi-operator/pull/606 @terrytangyuan PR submitted , works in our environment .
i created a mpijob with invalid pod template , i cant get mpijob status all the time ( i think the status should be Failed).
now i cant distinguish the mpijobs which are too new to get status and the mpijobs with invaild pod template
my mpijob shows below
kubectl get mpijob ai62da0dbe-6406-4252-85d6-51ef87eab10d -n cpod -oyaml
the output is :when describe the mpijob
output is :