As discussed in #823, pending machineconfig updates can be dangerous. A minor change (such as removing a node taint or a user changing the pod disruption budget for their application) could unblock a pending update, causing an unexpected reboot of one or more cluster nodes.
We need an alert that fires if a machineconfigpool is "updating" for more than a reasonable amount of time.
The production cluster is in this state right now:
$ kubectl get mcp worker -o custom-columns='NAME:.metadata.name,UPDATING:.status.conditions[?(@.type=="Updating")].status,SINCE:.status.conditions[?(@.type=="Updating")].lastTransitionTime'
NAME UPDATING SINCE
worker True 2024-11-13T15:25:34Z
As discussed in #823, pending machineconfig updates can be dangerous. A minor change (such as removing a node taint or a user changing the pod disruption budget for their application) could unblock a pending update, causing an unexpected reboot of one or more cluster nodes.
We need an alert that fires if a machineconfigpool is "updating" for more than a reasonable amount of time.
The production cluster is in this state right now: