Closed TJM closed 3 years ago
the same was on 9.4.4 and reproduces on 10.0.2
Prometheus is constantly outputting the following errors:
level=warn ts=2020-10-13T15:25:49.854Z caller=manager.go:577 component="rule manager" group=kubelet.rules msg="Evaluating rule failed" rule="record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile\nexpr: histogram_quantile(0.5, sum by(instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m]))\n * on(instance) group_left(node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"})\nlabels:\n quantile: \"0.5\"\n" err="found duplicate series for the match group {instance=\"10.9.25.189:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.9.25.189:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"den3l5kubew05.company.corp\", service=\"kube-prometheus-stack-kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.9.25.189:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"den3l5kubew05.company.corp\", service=\"prometheus-operator-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"
level=warn ts=2020-10-13T15:25:53.291Z caller=manager.go:577 component="rule manager" group=kubernetes-system-kubelet msg="Evaluating rule failed" rule="alert: KubeletPodStartUpLatencyHigh\nexpr: histogram_quantile(0.99, sum by(instance, le) (rate(kubelet_pod_worker_duration_seconds_bucket{job=\"kubelet\",metrics_path=\"/metrics\"}[5m])))\n * on(instance) group_left(node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"}\n > 60\nfor: 15m\nlabels:\n severity: warning\nannotations:\n message: Kubelet Pod startup 99th percentile latency is {{ $value }} seconds on\n node {{ $labels.node }}.\n runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletpodstartuplatencyhigh\n" err="found duplicate series for the match group {instance=\"10.9.25.189:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.9.25.189:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"den3l5kubew05.company.corp\", service=\"kube-prometheus-stack-kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.9.25.189:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"den3l5kubew05.company.corp\", service=\"prometheus-operator-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"
Or specifically:
many-to-many matching not allowed: matching labels must be unique on one side
Perhaps that helps?
Possible workaround: go to kube-system
namespace, check if there're multiple services named *-kube-prometheus-stack-kubelet
or *-prometheus-operator-kubelet
, and remove the unnecessary ones:
$ kubectl get service | grep kubelet
prom-kube-prometheus-stack-kubelet ClusterIP None <none> 10250/TCP,10255/TCP,4194/TCP 91m
prometheus-operator-kubelet ClusterIP None <none> 10250/TCP,10255/TCP,4194/TCP 104d
$ kubectl delete service prometheus-operator-kubelet
service "prometheus-operator-kubelet" deleted
How it works:
I got this issue after I migrated fromstable/prometheus-operator
chart. It seems that helm didn't remove the services it installed in the kube-system
namespace when I uninstall the deprecated chart. So the ServiceMonitor collects the same metrics from multiple services that have the same endpoints. Since some of the prometheus record rules, for example:
histogram_quantile(0.9,
sum by(instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) * on(instance)
group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"})
requires only one metric kubelet_node_name{job="kubelet",metrics_path="/metrics"}
per instance
. Therefore, all I had to do is to delete the redaudent services.
I'm not sure if this is a bug of helm, however it worked for me in this specific context.
You got it! We had about 5 different ones. There are multiple operators installed on this cluster because we have prometheus monitoring applications outside prometheus and found that there were SOO MANY jobs that it was bogging down prometheus, so we split it up to multiple prometheuses (or is that prometheii?). Anyhow the initial installations may have had some "indentation" issues when they tried to disable all the extra stuff to just get prometheus/grafana, and they left behind some svcs.
So, I do see it mentioned in the docs: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#migrating-from-stableprometheus-operator-chart
Tommy
I saw this same issue as well, thanks for the great explanation @wi1dcard
@wi1dcard did you retain the chart name by any chance while migrating to the new version?
@wi1dcard did you retain the chart name by any chance while migrating to the new version?
No, I didn't. Is that the Helm release name related to the issue?
*> Why does it create a service that it doesn't remove? (Is that fixed in future versions?)
- Why doesn't it use a "template" (?) to select the service it is looking for instead of a wildcard?**
~Isn't the point of the templates + values is to properly configure the services. Why then do we have to delete anything after start up? Can't we configure the chart to start with necessary services so the queries return accurate results out of the box?~
Update: My problem was related from VMware Tanzu Mission Control being installed in the same cluster. It works just fine out of the box.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
I think we are still waiting on...
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
I also have this problem. I checked for double services, but I cannot find any... I looked at the rule definitions and ran some prometheus queries and found this;
prometheus_rule_evaluation_failures_total{container="prometheus", endpoint="web", instance="10.0.1.72:9090", job="kube-prometheus-stack-prometheus", namespace="monitoring", pod="prometheus-kube-prometheus-stack-prometheus-0", rule_group="/etc/prometheus/rules/prometheus-kube-prometheus-stack-prometheus-rulefiles-0/monitoring-kube-prometheus-stack-kube-apiserver-availability.rules.yaml;kube-apiserver-availability.rules", service="kube-prometheus-stack-prometheus"} | 113
-- | --
prometheus_rule_evaluation_failures_total{container="prometheus", endpoint="web", instance="10.0.1.72:9090", job="kube-prometheus-stack-prometheus", namespace="monitoring", pod="prometheus-kube-prometheus-stack-prometheus-0", rule_group="/etc/prometheus/rules/prometheus-kube-prometheus-stack-prometheus-rulefiles-0/monitoring-kube-prometheus-stack-kube-apiserver.rules.yaml;kube-apiserver.rules", service="kube-prometheus-stack-prometheus"} | 134
It's those 2 that keep increasing, occasionally triggering the alert. Is this the same thing, or something completely different? Happens on 2 GKE clusters running the latest version of kube-prometheus-stack.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
This issue is being automatically closed due to inactivity.
Any news here? We have the same problem in an AWS EKS cluster. I cannot find any duplicate service but the alert is still firing.
I have such error in each Kubernetes installation which i had.
I think I have managed to fix it by setting prometheusOperator.kubeletService.enabled
to false
in the values.yaml.
For example: https://github.com/WesleyKlop/infrastructure/commit/63ca4e667a6a3dfce20029aaeede2fa0e98f3d39.
You might also be able to fix it by setting prometheusOperator.kubeletService.name
to just kubelet
since the other service which also says it is managed by prometheus operator has that name. (I did not test that)
Describe the bug After upgrading to 9.4.10 I am seeing PrometheusRuleFailures:
Version of Helm and Kubernetes:
Helm Version:
Kubernetes Version:
Which chart: prometheus-kube-stack Which version of the chart: 9.4.10
What happened: After upgrade, we have new alerts that probably aren't supposed to be there?
What you expected to happen: No PrometheusRuleFailures ?
How to reproduce it (as minimally and precisely as possible): Upgrade to 9.4.10
Anything else we need to know:
We were not seeing these on 9.3.4?