Fix KubeClientCertificateExpiration alerts

kubernetes-monitoring / kubernetes-mixin

A set of Grafana dashboards and Prometheus alerts for Kubernetes.

Apache License 2.0

2.12k stars 597 forks source link

Fix KubeClientCertificateExpiration alerts #941

Closed 7840vz closed 2 weeks ago

7840vz commented 5 months ago

1) Fix aggregation for on(job) to become (job, cluster, instance). Otherwise, It would be enough to have just single instance with certificate expiration problem, and it would set all apiservers to 'firing' (false positive!).

2) Also, change aggregation by (le) to without(service,endpoint...), dropping only useless labels, but keeping external labels (like environment etc) intact. Otherwise they get dropped.

3) Change order of metrics in expression: apiserver_client_certificate_expiration_seconds_bucket metric comes first so actual expiration date is shown as result in Grafana->Explore queries, not apiserver_client_certificate_expiration_seconds_count value (which is quite useless). This make it easier to troubleshoot.

github-actions[bot] commented 2 months ago

This PR has been automatically marked as stale because it has not had any activity in the past 30 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed in 7 days if there is no new activity. Thank you for your contributions!

skl commented 2 months ago

commenting to keep this open a little longer, looks like a genuine issue

skl commented 4 weeks ago

@7840vz would you have a moment to address the conflicts?

7840vz commented 3 weeks ago

resolved conflicts

7840vz commented 3 weeks ago

could you check this one as well pls? https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/942

lorenzofelletti commented 3 weeks ago

It would be nice to have also the cluster indicated in the alert description (here, and here):

description: 'A client certificate used to authenticate to kubernetes apiserver is expiring in less than %s on cluster {{ $labels.%s }}.' % [(utils.humanizeSeconds($._config.certExpirationWarningSeconds)), $._config.clusterLabel],

skl commented 2 weeks ago

@lorenzofelletti Having the cluster labels in the description is definitely desirable but not everyone has the cluster label, so might need some logic in the descriptions to check if they have multiCluster enabled or not (I think some alerts do this iirc). I think that can be addressed as a separate issue as there's probably a whole bunch of alerts that would benefit from that.