kubernetes-monitoring / kubernetes-mixin

A set of Grafana dashboards and Prometheus alerts for Kubernetes.
Apache License 2.0
2.11k stars 597 forks source link

RFE: CPU and Memory Overcommit for only worker labeled nodes #291

Open mitchellmaler opened 5 years ago

mitchellmaler commented 5 years ago

Currently the over commit takes in account all nodes in the cluster. We ran into an issue where the cluster did not alert for over-commit even though the schedulable "worker" nodes requests are maxed out. We were able to fix the issue by scaling. The issue stems from the alert rule taking in account all nodes instead of the scheduable nodes such as the "workers" or non-tainted nodes. Nodes such as the masters and tainted nodes used for special cases are not usually scheduling pods and shouldn't be part of the global over-commit rule. I am proposing to keep the existing rules for the whole cluster but then bring in rules that take in account only the worker nodes with also the option to filter out worker taints for specialized nodes that don't schedule often either.

Maybe even a warning rule that can alert when a node or multiple nodes are maxing their requests instead of just the whole cluster rule.

brancz commented 5 years ago

I think this is reasonable, if possible it would be neat if we could do it for all node roles used in a cluster. If not then I’d say we just make it configurable alerts.

mitchellmaler commented 5 years ago

Yea I agree there that would be really useful.