kubernetes-monitoring / kubernetes-mixin

A set of Grafana dashboards and Prometheus alerts for Kubernetes.
Apache License 2.0
2.08k stars 597 forks source link

Fix KubeClientCertificateExpiration alerts #941

Open 7840vz opened 2 months ago

7840vz commented 2 months ago

1) Fix aggregation for on(job) to become (job, cluster, instance). Otherwise, It would be enough to have just single instance with certificate expiration problem, and it would set all apiservers to 'firing' (false positive!).

2) Also, change aggregation by (le) to without(service,endpoint...), dropping only useless labels, but keeping external labels (like environment etc) intact. Otherwise they get dropped.

3) Change order of metrics in expression: apiserver_client_certificate_expiration_seconds_bucket metric comes first so actual expiration date is shown as result in Grafana->Explore queries, not apiserver_client_certificate_expiration_seconds_count value (which is quite useless). This make it easier to troubleshoot.

github-actions[bot] commented 3 days ago

This PR has been automatically marked as stale because it has not had any activity in the past 30 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed in 7 days if there is no new activity. Thank you for your contributions!

skl commented 2 days ago

commenting to keep this open a little longer, looks like a genuine issue