kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.6k stars 697 forks source link

V1: Introduce the CustomResourceValidationExpressions feature (CEL validation) #1708

Open tenzen-y opened 1 year ago

tenzen-y commented 1 year ago

/kind feature

Since k8s v1.25, the CustomResourceValidationExpressions feature to validate CRDs using Common Expression Language (CEL) without webhook servers is enabled by default.

Since this feature helps to find CRD validation errors for users, we need to introduce that feature once we stop supporting K8s v1.24.

For example, if the container name of replicaSpec is invalid, for now, we only output that error to controller logs; once we introduce CEL validation, we can return that error to end-users.

https://github.com/kubeflow/training-operator/blob/69813fbc37f33857e59044d14caa7e2765851d98/pkg/apis/kubeflow.org/v1/tensorflow_validation.go#L69-L74

tenzen-y commented 1 year ago

/assign

tenzen-y commented 1 year ago

Through my investigation, I found that we can not replace all validations with CEL validation due to exceeding cost budget.

For example, we can replace the below validation

https://github.com/kubeflow/training-operator/blob/4dd0d0909210edc7e1a31a5987cb32dc834d682b/pkg/apis/kubeflow.org/v1/pytorch_validation.go#L69-L72

with the below CEL validation:

// ReplicaSpec is a description of the replica
type ReplicaSpec struct {
...
    // Template is the object that describes the pod that
    // will be created for this replica. RestartPolicy in PodTemplateSpec
    // will be overide by RestartPolicy in ReplicaSpec
    // +kubebuilder:validation:XValidation:rule="has(self.spec.containers) && self.spec.containers.all(c, has(c.image) && size(c.image) > 0)",message=""
    Template v1.PodTemplateSpec `json:"template,omitempty"`
...

Forbidden: contributed to estimated rule cost total exceeding cost limit for entire OpenAPIv3 schema, spec.validation.openAPIV3Schema: Forbidden: x-kubernetes-validations estimated rule cost total for entire OpenAPIv3 schema exceeds budget by factor of more than 100x (try simplifying the rule, or adding maxItems, maxProperties, and maxLength where arrays, maps, and strings are declared)]

So, I think that we need to introduce the webhook validation instead of replacing the current validation with CEL validation.

@kubeflow/wg-training-leads WDYT?

johnugeorge commented 1 year ago

I remember, we had a discussion regarding webhooks. The arguments were deployment easiness(without web hooks) vs clean validation(with webhooks)

tenzen-y commented 1 year ago

The arguments were deployment easiness

@johnugeorge Are they concerned about certs for the webhook?

johnugeorge commented 1 year ago

Yes. That was the major raod block from what I remember

tenzen-y commented 1 year ago

@johnugeorge I think we can remove webhook installation barriers once we introduce cert generation logic similar to katib. WDYT?

Actually, I have experience implementing internal cert generation logic in 2 components (kubeflow/katib, kubernetes-sigs/kueue).

tenzen-y commented 1 year ago

Maybe, we generalize the katib cert generator, and then just import the cert-generator to the training-operator.

https://github.com/kubeflow/katib/blob/eac65bf8e31fe4d3a2c4dbb20be30d600c086157/cmd/katib-controller/v1beta1/main.go#L137-L145

https://github.com/kubeflow/katib/tree/eac65bf8e31fe4d3a2c4dbb20be30d600c086157/pkg/certgenerator/v1beta1

cc: @andreyvelich

johnugeorge commented 1 year ago

@tenzen-y Shall we move this to the next release?

tenzen-y commented 1 year ago

@tenzen-y Shall we move this to the next release?

Which does that mean training-operator v1.7 or v1.8?

johnugeorge commented 1 year ago

We are cutting first 1.7 RC in few days. I am afraid, we cannot complete testing within time if we want to aim this feature in 1.7. What are you thoughts on moving to 1.8?

tenzen-y commented 1 year ago

We are cutting first 1.7 RC in few days. I am afraid, we cannot complete testing within time if we want to aim this feature in 1.7. What are you thoughts on moving to 1.8?

It makes sense. We can work on this feature for v1.8. Actually, I don't have enough bandwidth for this feature.

tenzen-y commented 1 year ago

Also, we can create another issue for the webhook since this issue aims CEL validation.

andreyvelich commented 1 year ago

Thank you for creating this @tenzen-y. I agree that we should postpone the discussion around this. We need to identify pros and cons of using CEL vs Webhook Validation. As Johnu mentioned before, adding Webhook as a mandatory component for Training Operator will complicate end-user usage.

tenzen-y commented 1 year ago

Anyway, I will create an issue to discuss webhook since CEL validation isn't enough due to exceeding the cost budge.

https://github.com/kubeflow/training-operator/issues/1708#issuecomment-1661876525

tenzen-y commented 1 year ago

/remove-area 1.7.0

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y commented 10 months ago

/remove-lifecycle stale

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y commented 7 months ago

/remove-lifecycle stale

johnugeorge commented 6 months ago

Moved to 1.9 release

tenzen-y commented 6 months ago

Moved to 1.9 release

I agreed with this decision in the offline meeting with @johnugeorge.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y commented 3 months ago

/remove-lifecycle stale

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y commented 3 weeks ago

/lifecycle frozen