prometheus-community / helm-charts

Prometheus community Helm charts
Apache License 2.0
5.01k stars 5k forks source link

[kube-prometheus-stack] Alert "KubeClientCertificateExpiration" expression output showing wrong values #3441

Open jaisegrg opened 1 year ago

jaisegrg commented 1 year ago

Describe the bug a clear and concise description of what the bug is.

Kube-prometheus-stack helm chart is installed in an AKS cluster, but there is an issue with "KubeClientCertificateExpiration" alert, which shows wrong values for the expression output. Validated the "kube-apiserver" certificate expiration based on the output by converting the output value which is in seconds to days, but its not matching the alerts.

Version: -kube-prometheus-stack-45.8.1 v0.63.0

  **- alert: KubeClientCertificateExpiration**
    annotations:
      description: A client certificate used to authenticate to kubernetes apiserver
        is expiring in less than 7.0 days.
      runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclientcertificateexpiration
      summary: Client certificate is about to expire.
    expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"}
      > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m])))
      < 604800
    for: 5m
    labels:
      severity: warning

  **- alert: KubeClientCertificateExpiration**
    annotations:
      description: A client certificate used to authenticate to kubernetes apiserver
        is expiring in less than 24.0 hours.
      runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclientcertificateexpiration
      summary: Client certificate is about to expire.
    expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"}
      > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m])))
      < 86400
    for: 5m
    labels:
      severity: critical

image

What's your helm version?

version.BuildInfo{Version:"v3.8.0", GitCommit:"d14138609b01886f544b2025f5000351c9eb092e", GitTreeState:"clean", GoVersion:"go1.17.5"}

What's your kubectl version?

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-02-16T12:38:05Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.5", GitCommit:"fd6aae27a28fca7e8b996d7201b0da6fbf6f732a", GitTreeState:"clean", BuildDate:"2023-04-08T13:27:20Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}

Which chart?

prometheus-community/kube-prometheus-stack

What's the chart version?

kube-prometheus-stack-45.8.1 v0.63.0

What happened?

No response

What you expected to happen?

No response

How to reproduce it?

No response

Enter the changed values of values.yaml?

No response

Enter the command that you execute and failing/misfunctioning.

helm upgrade --install prometheus-central \ --namespace monitoring \ prometheus-community/kube-prometheus-stack

Anything else we need to know?

No response

ykfq commented 1 year ago

Faced with same issue when I restart promtheus, then I upgrade promtheus to latest 2.24.0 and restart I again, the problem disappeared.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

koooge commented 3 months ago

Hi there, I encountered the same issue, and I don't think the expr query work as expected. I think the `apiserver_client_certificate_expiration_seconds_count{job="apiserver"}

0 and on(job)` part is not correct. As the result, the value of the whole query is not decreasing but monotonically increasing. I know it should be fixed in https://github.com/kubernetes-monitoring/kubernetes-mixin

koooge commented 3 months ago

As the workaround this worked to me:

histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m])))
and on (job) apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0

refs https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/941