Open tenzen-y opened 1 year ago
/assign
Through my investigation, I found that we can not replace all validations with CEL validation due to exceeding cost budget.
For example, we can replace the below validation
with the below CEL validation:
// ReplicaSpec is a description of the replica
type ReplicaSpec struct {
...
// Template is the object that describes the pod that
// will be created for this replica. RestartPolicy in PodTemplateSpec
// will be overide by RestartPolicy in ReplicaSpec
// +kubebuilder:validation:XValidation:rule="has(self.spec.containers) && self.spec.containers.all(c, has(c.image) && size(c.image) > 0)",message=""
Template v1.PodTemplateSpec `json:"template,omitempty"`
...
Forbidden: contributed to estimated rule cost total exceeding cost limit for entire OpenAPIv3 schema, spec.validation.openAPIV3Schema: Forbidden: x-kubernetes-validations estimated rule cost total for entire OpenAPIv3 schema exceeds budget by factor of more than 100x (try simplifying the rule, or adding maxItems, maxProperties, and maxLength where arrays, maps, and strings are declared)]
So, I think that we need to introduce the webhook validation instead of replacing the current validation with CEL validation.
@kubeflow/wg-training-leads WDYT?
I remember, we had a discussion regarding webhooks. The arguments were deployment easiness(without web hooks) vs clean validation(with webhooks)
The arguments were deployment easiness
@johnugeorge Are they concerned about certs for the webhook?
Yes. That was the major raod block from what I remember
@johnugeorge I think we can remove webhook installation barriers once we introduce cert generation logic similar to katib. WDYT?
Actually, I have experience implementing internal cert generation logic in 2 components (kubeflow/katib, kubernetes-sigs/kueue).
Maybe, we generalize the katib cert generator, and then just import the cert-generator to the training-operator.
cc: @andreyvelich
@tenzen-y Shall we move this to the next release?
@tenzen-y Shall we move this to the next release?
Which does that mean training-operator v1.7 or v1.8?
We are cutting first 1.7 RC in few days. I am afraid, we cannot complete testing within time if we want to aim this feature in 1.7. What are you thoughts on moving to 1.8?
We are cutting first 1.7 RC in few days. I am afraid, we cannot complete testing within time if we want to aim this feature in 1.7. What are you thoughts on moving to 1.8?
It makes sense. We can work on this feature for v1.8. Actually, I don't have enough bandwidth for this feature.
Also, we can create another issue for the webhook since this issue aims CEL validation.
Thank you for creating this @tenzen-y. I agree that we should postpone the discussion around this. We need to identify pros and cons of using CEL vs Webhook Validation. As Johnu mentioned before, adding Webhook as a mandatory component for Training Operator will complicate end-user usage.
Anyway, I will create an issue to discuss webhook since CEL validation isn't enough due to exceeding the cost budge.
https://github.com/kubeflow/training-operator/issues/1708#issuecomment-1661876525
/remove-area 1.7.0
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale
Moved to 1.9 release
Moved to 1.9 release
I agreed with this decision in the offline meeting with @johnugeorge.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/lifecycle frozen
/kind feature
Since k8s v1.25, the CustomResourceValidationExpressions feature to validate CRDs using Common Expression Language (CEL) without webhook servers is enabled by default.
Since this feature helps to find CRD validation errors for users, we need to introduce that feature once we stop supporting K8s v1.24.
For example, if the container name of replicaSpec is invalid, for now, we only output that error to controller logs; once we introduce CEL validation, we can return that error to end-users.
https://github.com/kubeflow/training-operator/blob/69813fbc37f33857e59044d14caa7e2765851d98/pkg/apis/kubeflow.org/v1/tensorflow_validation.go#L69-L74