kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.51k stars 1.3k forks source link

Improve calculation of NodeDrainTimeout & NodeVolumeDetachTimeout exceeded #11126

Open sbueringer opened 2 weeks ago

sbueringer commented 2 weeks ago

Today, we check if the NodeDrainTimeout & NodeVolumeDetachTimeout are exceeded by comparing time.Now with the LastTransitionTime of the corresponding conditions.

This is fragile for multiple reasons:

It would be already a good improvement if the following is implemented:

existing conditions utils changes lastTranistionTime whenever something changes in the condition, while future/upstream aligned conditions utils are changing lastTranistionTime only when status change https://github.com/kubernetes-sigs/cluster-api/issues/11033#issuecomment-2317535694

But I think instead of relying on the LastTransitionTime we should have additional status fields tracking when we started to drain & when we started to wait for volume detach and then calculate if the timeouts are exceeded based on that.

(Note: this is not relevant for NodeDeletionTimeout because for this one we use the deletionTimestamp field)

sbueringer commented 2 weeks ago

/triage accepted

sbueringer commented 2 weeks ago

/cc @fabriziopandini @chrischdi @enxebre

fabriziopandini commented 2 weeks ago

But I think instead of relying on the LastTransitionTime we should have additional status fields tracking when we started to drain & when we started to wait for volume detach and then calculate if the timeouts are exceeded based on that.

+1 I think that we should add a status.Termination struct with NodeDrainStartTime & NodeVolumeDetachStartTime fields. This will be somehow "consistent" with the Initialization struct proposed by https://github.com/kubernetes-sigs/cluster-api/pull/10897.

chrischdi commented 1 week ago

/assign

sbueringer commented 1 week ago

One small note, wondering if we should call it status.deletion instead of termination. I think in CAPI we usually always talk about "deletion"

+ NodeVolumeDetachStartTime should be WaitForNodeVolumeDetachStartTime

fabriziopandini commented 1 week ago

No strong opinions about the name. Termination makes a good pair to Initialization, it also reminds of stuff like gracetermination period etc etc

sbueringer commented 1 week ago

Yup, we just use deletion in a lot of places and we don't have things like terminationGracePeriod for Machines (but we have a deletionTimestamp)

(we're also always talking about Machine deletion, never really used/heard Machine termination)