HighOverallControlPlaneCPU

Alertmanager URL: https://console-openshift-console.apps.nostromo.erdgeschoss.b4mad.emea.operate-first.cloud/monitoring

firing https://console-openshift-console.apps.nostromo.erdgeschoss.b4mad.emea.operate-first.cloud/monitoring/graph?g0.expr=sum%28100+-+%28avg+by+%28instance%29+%28rate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B1m%5D%29%29+%2A+100%29+and+on+%28instance%29+label_replace%28kube_node_role%7Brole%3D%22master%22%7D%2C+%22instance%22%2C+%22%241%22%2C+%22node%22%2C+%22%28.%2B%29%22%29%29+%2F+count%28kube_node_role%7Brole%3D%22master%22%7D%29+%3E+60&g0.tab=1

Labels:
- alertname = HighOverallControlPlaneCPU
- environment = nostromo
- namespace = openshift-kube-apiserver
- openshift_io_alert_source = platform
- org = b4mad
- prometheus = openshift-monitoring/k8s
- region = emea
- severity = warning
Annotations:
- description = On a multi-node cluster with three control plane nodes, the overall CPU utilization may only be about 2/3 of all available capacity. This is because if a single control plane node fails, the remaining two must handle the load of the cluster in order to be HA. If the cluster is using more than 2/3 of all capacity, if one control plane node fails, the remaining two are likely to fail when they take the load. To fix this, increase the CPU and memory on your control plane nodes. On a single node OpenShift (SNO) cluster, this alert will also fire if the 2/3 of the CPU cores of the node are in use by any workload. This level of CPU utlization of an SNO cluster is probably not a problem under most circumstances, but high levels of utilization may result in degraded performance. To manage this alert or silence it in case of false positives see the following link: https://docs.openshift.com/container-platform/latest/monitoring/managing-alerts.html
- runbook_url = https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-apiserver-operator/ExtremelyHighIndividualControlPlaneCPU.md
- summary = CPU utilization across all control plane nodes is more than 60% of the total available CPU. Control plane node outage may cause a cascading failure; increase available CPU.

TODO: add graph url from annotations.

operate-first / alerts

HighOverallControlPlaneCPU #29685