Improve calculation of NodeDrainTimeout & NodeVolumeDetachTimeout exceeded

sbueringer commented 2 weeks ago

Today, we check if the NodeDrainTimeout & NodeVolumeDetachTimeout are exceeded by comparing time.Now with the LastTransitionTime of the corresponding conditions.

This is fragile for multiple reasons:

LastTransitionTime is updated when the condition flickers between true & false
LastTransitionTime is updated every time anything changes in the condition (e.g. message), xref:
- https://github.com/kubernetes-sigs/cluster-api/blob/4f1637e59c48528f7a5c1a3116054dbc96216f8c/util/conditions/setter.go#L54
- https://github.com/kubernetes-sigs/cluster-api/blob/4f1637e59c48528f7a5c1a3116054dbc96216f8c/util/conditions/setter.go#L199

It would be already a good improvement if the following is implemented:

existing conditions utils changes lastTranistionTime whenever something changes in the condition, while future/upstream aligned conditions utils are changing lastTranistionTime only when status change https://github.com/kubernetes-sigs/cluster-api/issues/11033#issuecomment-2317535694

But I think instead of relying on the LastTransitionTime we should have additional status fields tracking when we started to drain & when we started to wait for volume detach and then calculate if the timeouts are exceeded based on that.

(Note: this is not relevant for NodeDeletionTimeout because for this one we use the deletionTimestamp field)

sbueringer commented 2 weeks ago

/triage accepted

sbueringer commented 2 weeks ago

/cc @fabriziopandini @chrischdi @enxebre

fabriziopandini commented 2 weeks ago

But I think instead of relying on the LastTransitionTime we should have additional status fields tracking when we started to drain & when we started to wait for volume detach and then calculate if the timeouts are exceeded based on that.

+1 I think that we should add a status.Termination struct with NodeDrainStartTime & NodeVolumeDetachStartTime fields. This will be somehow "consistent" with the Initialization struct proposed by https://github.com/kubernetes-sigs/cluster-api/pull/10897.

chrischdi commented 1 week ago

/assign

sbueringer commented 1 week ago

One small note, wondering if we should call it status.deletion instead of termination. I think in CAPI we usually always talk about "deletion"

+ NodeVolumeDetachStartTime should be WaitForNodeVolumeDetachStartTime

fabriziopandini commented 1 week ago

No strong opinions about the name. Termination makes a good pair to Initialization, it also reminds of stuff like gracetermination period etc etc

sbueringer commented 1 week ago

Yup, we just use deletion in a lot of places and we don't have things like terminationGracePeriod for Machines (but we have a deletionTimestamp)

(we're also always talking about Machine deletion, never really used/heard Machine termination)

kubernetes-sigs / cluster-api

Improve calculation of NodeDrainTimeout & NodeVolumeDetachTimeout exceeded #11126