prometheus-operator / kube-prometheus

Use Prometheus to monitor Kubernetes and applications running on Kubernetes
https://prometheus-operator.dev/
Apache License 2.0
6.65k stars 1.92k forks source link

Grafana dashborard API server availability (30d) is showing more than 100% #2349

Open ssharma2089 opened 7 months ago

ssharma2089 commented 7 months ago

I have upgraded the kube-prometheus from v0.12 to v0.13 and API server availability is showing more than 100%. In the previous version, its showing correctly.

Attaching image for dashboard

grafana_apiserver

Environment

k3s

ts=2024-02-13T11:59:38.233Z caller=main.go:585 level=info msg="Starting Prometheus Server" mode=server version="(version=2.46.0, branch=HEAD, revision=cbb69e51423565ec40f46e74f4ff2dbb3b7fb4f0)" ts=2024-02-13T11:59:38.233Z caller=main.go:590 level=info build_context="(go=go1.20.6, platform=linux/amd64, user=root@42454fc0f41e, date=20230725-12:31:24, tags=netgo,builtinassets,stringlabels)" ts=2024-02-13T11:59:38.233Z caller=main.go:591 level=info host_details="(Linux 4.18.0-513.11.1.el8_9.x86_64 #1 SMP Thu Dec 7 03:06:13 EST 2023 x86_64 prometheus-k8s-0 (none))" ts=2024-02-13T11:59:38.233Z caller=main.go:592 level=info fd_limits="(soft=1048576, hard=1048576)" ts=2024-02-13T11:59:38.233Z caller=main.go:593 level=info vm_limits="(soft=unlimited, hard=unlimited)" ts=2024-02-13T11:59:38.235Z caller=web.go:563 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090 ts=2024-02-13T11:59:38.236Z caller=main.go:1026 level=info msg="Starting TSDB ..." ts=2024-02-13T11:59:38.237Z caller=tls_config.go:274 level=info component=web msg="Listening on" address=[::]:9090 ts=2024-02-13T11:59:38.238Z caller=head.go:595 level=info component=tsdb msg="Replaying on-disk memory mappable chunks if any" ts=2024-02-13T11:59:38.260Z caller=head.go:676 level=info component=tsdb msg="On-disk memory mappable chunks replay completed" duration=4.959µs ts=2024-02-13T11:59:38.260Z caller=head.go:684 level=info component=tsdb msg="Replaying WAL, this may take a while" ts=2024-02-13T11:59:38.260Z caller=tls_config.go:313 level=info component=web msg="TLS is disabled." http2=false address=[::]:9090 ts=2024-02-13T11:59:38.261Z caller=head.go:755 level=info component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0 ts=2024-02-13T11:59:38.261Z caller=head.go:792 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=64.923µs wal_replay_duration=591.788µs wbl_replay_duration=142ns total_replay_duration=702.782µs ts=2024-02-13T11:59:38.261Z caller=main.go:1047 level=info fs_type=XFS_SUPER_MAGIC ts=2024-02-13T11:59:38.261Z caller=main.go:1050 level=info msg="TSDB started" ts=2024-02-13T11:59:38.261Z caller=main.go:1231 level=info msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml ts=2024-02-13T11:59:38.290Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=kubernetes-cadvisor msg="Using pod service account via in-cluster config" ts=2024-02-13T11:59:38.291Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=kubernetes-pods msg="Using pod service account via in-cluster config" ts=2024-02-13T11:59:38.291Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/monitoring/alertmanager-main/1 msg="Using pod service account via in-cluster config" ts=2024-02-13T11:59:38.291Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/monitoring/coredns/0 msg="Using pod service account via in-cluster config" ts=2024-02-13T11:59:38.292Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/monitoring/kafka-service-monitor/0 msg="Using pod service account via in-cluster config" ts=2024-02-13T11:59:38.292Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/monitoring/kube-apiserver/0 msg="Using pod service account via in-cluster config" ts=2024-02-13T11:59:38.292Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=kubernetes-services msg="Using pod service account via in-cluster config" ts=2024-02-13T11:59:38.292Z caller=kubernetes.go:329 level=info component="discovery manager notify" discovery=kubernetes config=config-0 msg="Using pod service account via in-cluster config" ts=2024-02-13T11:59:38.432Z caller=main.go:1268 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=170.302838ms db_storage=1.527µs remote_storage=1.349µs web_handler=561ns query_engine=792ns scrape=316.7µs scrape_sd=2.107591ms notify=25.468µs notify_sd=176.505µs rules=139.444513ms tracing=10.148µs ts=2024-02-13T11:59:38.432Z caller=main.go:1011 level=info msg="Server is ready to receive web requests." ts=2024-02-13T11:59:38.432Z caller=manager.go:1009 level=info component="rule manager" msg="Starting rule manager..." ts=2024-02-13T11:59:42.826Z caller=main.go:1231 level=info msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml ts=2024-02-13T11:59:42.849Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/monitoring/kube-state-metrics/1 msg="Using pod service account via in-cluster config" ts=2024-02-13T11:59:42.851Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/monitoring/coredns/0 msg="Using pod service account via in-cluster config" ts=2024-02-13T11:59:42.852Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/monitoring/kube-apiserver/0 msg="Using pod service account via in-cluster config" ts=2024-02-13T11:59:42.852Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=kubernetes-pods msg="Using pod service account via in-cluster config" ts=2024-02-13T11:59:42.853Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/monitoring/kafka-service-monitor/0 msg="Using pod service account via in-cluster config" ts=2024-02-13T11:59:42.854Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=kubernetes-cadvisor msg="Using pod service account via in-cluster config" ts=2024-02-13T11:59:42.854Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=kubernetes-services msg="Using pod service account via in-cluster config" ts=2024-02-13T11:59:42.854Z caller=kubernetes.go:329 level=info component="discovery manager notify" discovery=kubernetes config=config-0 msg="Using pod service account via in-cluster config" ts=2024-02-13T11:59:43.006Z caller=main.go:1268 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=179.767429ms db_storage=2.059µs remote_storage=1.432µs web_handler=667ns query_engine=978ns scrape=65.891µs scrape_sd=5.288807ms notify=16.27µs notify_sd=171.972µs rules=150.995181ms tracing=6.049µs

shun095 commented 1 week ago

I met a similar problem in k3s home lab environment. Like following issue, by adding {job="apiserver"} in following section solves the problem. https://github.com/prometheus-operator/kube-prometheus/issues/2465

(Other terms in the fomula are filterd by {job="apiserver"} filter, but only the apiserver_request_sli_duration_seconds_bucket is not filterd.)

https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/kube-apiserver-availability.rules.yaml

$ git diff | cat                                                                                                                                                              diff --git a/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/kube-apiserver-availability.rules.yaml b/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/kube-apiserver-availability.rules.yaml
index 27399b19..e978af06 100644                                                                                                                                               --- a/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/kube-apiserver-availability.rules.yaml
+++ b/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/kube-apiserver-availability.rules.yaml
@@ -82,7 +82,7 @@ spec:
           {{- toYaml . | nindent 8 }}
         {{- end }}
       {{- end }}
-    - expr: sum by ({{ range $.Values.defaultRules.additionalAggregationLabels }}{{ . }},{{ end }}cluster, verb, scope, le) (increase(apiserver_request_sli_duration_seconds_bucket[1h]))
+    - expr: sum by ({{ range $.Values.defaultRules.additionalAggregationLabels }}{{ . }},{{ end }}cluster, verb, scope, le) (increase(apiserver_request_sli_duration_seconds_bucket{job="apiserver"}[1h]))
       record: cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase1h
       {{- if or .Values.defaultRules.additionalRuleLabels .Values.defaultRules.additionalRuleGroupLabels.kubeApiserverAvailability }}
       labels:

K3s has several components in a single process, so metrics of Kubernetes components duplicates between different Prometheus jobs. So, this filter was important in my case. https://github.com/k3s-io/k3s/issues/2262

In my k3s cluster, apiserver_request_sli_duration_seconds_bucket metrics are collected by following 5 jobs. ( and availability was indicated around 500% :) )

I guess current latest code which forgets adding job="apiserver" filtering probably doesn't cause problem in normal kubeadm cluster, so this issue may not have been paid attention so much. (I haven't checked if it's correct because I don't have a cluster for test which built by kubeadm)