kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.57k stars 1.31k forks source link

Webhook validation for Topology NodeDeletionTimeout and NodeDrainTimeout #7104

Open killianmuldoon opened 2 years ago

killianmuldoon commented 2 years ago

NodeDeletionTimeout and NodeDrainTimeout were added to Topology managed clusters in #7098 and #6278. Currently the values of these fields are not validated on creation, and validation is instead done when the templates are turned into objects.

This lack of up-front validation lead to the unexpected failure in #7047. We could do some basic validation in the webhook on object creation to ensure these values are correctly formatted and in a given range before creation.

/kind feature

killianmuldoon commented 2 years ago

/area topology

sbueringer commented 2 years ago

What would be the valid range for those fields?

killianmuldoon commented 2 years ago

We don't have these defined right now in the machine webhook (and I don't know if there's any need to), but defining a min/max is an optional part of this.

I think the main part is to ensure that we do enough validation to catch errors like #7047 on object creation, instead of during the reconcile.

sbueringer commented 2 years ago

Yup. The problem is that metav1.Duration just has type "string" as OpenAPI schema, right?

If it would also use format duration OpenAPI would probably handle it for us? (via: // +kubebuilder:validation:Format)

https://github.com/kubernetes/apiextensions-apiserver/blob/master/pkg/apiserver/validation/formats.go#L49

But given the recent trend we would instead of the marker implement it in the webhook. (the format godoc sounds like we should use time.ParseDuration)

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sbueringer commented 1 year ago

/remove-lifecycle stale

fabriziopandini commented 1 year ago

/triage accepted /remove-kind feature /kind bug

fabriziopandini commented 6 months ago

/priority important-soon

k8s-triage-robot commented 3 months ago

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

sbueringer commented 2 months ago

/triage accepted

Dhairya-Arora01 commented 1 month ago

/assign

JoelSpeed commented 1 month ago

What happens to the existing users who have persisted bad values when we update the validation here? Has it been considered to use ratcheting validation at all?

sbueringer commented 4 weeks ago

I think it was not considered

JoelSpeed commented 4 weeks ago

Ratcheting validation exists directly within the API server from Kube 1.30, but since we need to support older versions, ratcheting can either be implemented in a webhook, or, within a couple of well crafted CEL transition rules (though these aren't perfect as they don't cover the create case).

Without ratcheting, this does have the potential to break users on upgrade, they wouldn't be able to write anything to the object until the values of these broken fields were fixed.

sbueringer commented 4 weeks ago

Ratcheting validation exists directly within the API server from Kube 1.30

If it's enabled per default it could be okay to just wait until 1.30 is the min supported version (Cluster API v1.10, basically we could then merge in December)