nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

Create alert for blocked machineconfig updates #824

Open larsks opened 1 day ago

larsks commented 1 day ago

As discussed in #823, pending machineconfig updates can be dangerous. A minor change (such as removing a node taint or a user changing the pod disruption budget for their application) could unblock a pending update, causing an unexpected reboot of one or more cluster nodes.

We need an alert that fires if a machineconfigpool is "updating" for more than a reasonable amount of time.

The production cluster is in this state right now:

$ kubectl get mcp worker -o custom-columns='NAME:.metadata.name,UPDATING:.status.conditions[?(@.type=="Updating")].status,SINCE:.status.conditions[?(@.type=="Updating")].lastTransitionTime'
NAME     UPDATING   SINCE
worker   True       2024-11-13T15:25:34Z
schwesig commented 1 day ago

/CC @schwesig