Closed junaid-ali closed 2 years ago
Hey @junaid-ali, I noticed that the query your executed inside of the replica was slightly different from the one in the logs, it is missing &time=1621433298.929
which fixes the timestamp for which prometheus-adapter gets metrics from Prometheus. Could you try again with it?
Also, if you try executing the query in the Prometheus UI, do you see any gaps in the graph that could induce some scrape failures from Prometheus?
@dgrisonnet I tried with timestamp as well, and that did return the same result. Here's a recent one:
$ wget -qO- http://prometheus-kube-prometheus-prometheus.default.svc:9090/prometheus/api/v1/query?query=sum%281+-+irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B5m%5D%29+%2A+on%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7B%7D%29+by+%28node%29\&time=1621591878.122. #Added a slash with ampersand sign
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"node":"ip-10-0-6-55.******"},"value":[1621591878.122,"0.05399999999935978"]},{"metric":{"node":"ip-10-0-4-20.******"},"value":[1621591878.122,"0.05233333333308099"]},{"metric":{"node":"ip-10-0-4-227.******"},"value":[1621591878.122,"0.1836666666630964"]},{"metric":{"node":"ip-10-0-5-164.******"},"value":[1621591878.122,"0.1860000000005433"]},{"metric":{"node":"ip-10-0-5-175.******"},"value":[1621591878.122,"1.7483333333337212"]},{"metric":{"node":"ip-10-0-6-138.******"},"value":[1621591878.122,"0.0543333333330035"]}]}}
$ wget -qO- http://prometheus-kube-prometheus-prometheus.default.svc:9090/prometheus/api/v1/query?query=sum%28%28node_memory_MemTotal_bytes%7Bjob%3D%22node-exporter%22%7D+-+node_memory_MemAvailable_bytes%7Bjob%3D%22node-exporter%22%7D%29+%2A+on+%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7B%7D%29+by+%28node%29\&time=1621591880.903 #Added a slash with ampersand sign
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"node":"ip-10-0-4-227.******"},"value":[1621591880.903,"1425924096"]},{"metric":{"node":"ip-10-0-5-164.******"},"value":[1621591880.903,"1008308224"]},{"metric":{"node":"ip-10-0-5-175.******"},"value":[1621591880.903,"1755410432"]},{"metric":{"node":"ip-10-0-6-138.******"},"value":[1621591880.903,"842174464"]},{"metric":{"node":"ip-10-0-6-55.******"},"value":[1621591880.903,"843423744"]},{"metric":{"node":"ip-10-0-4-20.******"},"value":[1621591880.903,"896737280"]}]}}
At the moment, there's only one replica, and I'm getting the following error in the logs:
.
.
.
I0521 10:13:50.737655 1 handler.go:143] prometheus-metrics-adapter: GET "/apis/metrics.k8s.io/v1beta1/nodes" satisfied by gorestful with webservice /apis/metrics.k8s.io/v1beta1
I0521 10:13:50.739757 1 api.go:74] GET http://prometheus-kube-prometheus-prometheus.default.svc:9090/prometheus/api/v1/query?query=sum%28%28node_memory_MemTotal_bytes%7Bjob%3D%22node-exporter%22%7D+-+node_memory_MemAvailable_bytes%7Bjob%3D%22node-exporter%22%7D%29+%2A+on+%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7B%7D%29+by+%28node%29&time=1621592030.737 200 OK
I0521 10:13:50.740106 1 api.go:74] GET http://prometheus-kube-prometheus-prometheus.default.svc:9090/prometheus/api/v1/query?query=sum%281+-+irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B5m%5D%29+%2A+on%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7B%7D%29+by+%28node%29&time=1621592030.737 200 OK
I0521 10:13:50.740283 1 provider.go:277] missing CPU for node "ip-10-0-4-20.******", skipping
I0521 10:13:50.740298 1 provider.go:277] missing CPU for node "ip-10-0-4-227.******", skipping
I0521 10:13:50.740306 1 provider.go:277] missing CPU for node "ip-10-0-5-164.******", skipping
I0521 10:13:50.740314 1 provider.go:277] missing CPU for node "ip-10-0-5-175.******", skipping
I0521 10:13:50.740321 1 provider.go:277] missing CPU for node "ip-10-0-6-138.******", skipping
I0521 10:13:50.740329 1 provider.go:277] missing CPU for node "ip-10-0-6-55.******", skipping
I0521 10:13:50.740420 1 httplog.go:89] "HTTP" verb="GET" URI="/apis/metrics.k8s.io/v1beta1/nodes" latency="3.400931ms" userAgent="kubectl/v1.18.0 (darwin/amd64) kubernetes/9e99141" srcIP="10.0.6.74:59194" resp=200
No issues with running the queries directly in the prometheus UI:
CPU query
sum(1 - irate(node_cpu_seconds_total{mode="idle"}[5m]) * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:{}) by (node)
CPU query result:
{node="ip-10-0-6-138.******"} | 0.05699999999997085
{node="ip-10-0-6-55.******"} | 0.05233333333308099
{node="ip-10-0-4-20.******"} | 0.05433333333348844
{node="ip-10-0-4-227.******"} | 0.33099999999394636
{node="ip-10-0-5-164.******"} | 0.15966666666596818
{node="ip-10-0-5-175.******"} | 3.2099999999996123
Memory query:
sum((node_memory_MemTotal_bytes{job="node-exporter"} - node_memory_MemAvailable_bytes{job="node-exporter"}) * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:{}) by (node)
Memory query result:
{node="ip-10-0-5-164.******"} | 1000628224
{node="ip-10-0-5-175.******"} | 892575744
{node="ip-10-0-6-138.******"} | 845381632
{node="ip-10-0-6-55.******"} | 843325440
{node="ip-10-0-4-20.******"} | 896249856
{node="ip-10-0-4-227.******} | 1429225472
@dgrisonnet have you had a chance to look at the previous reply? Also, wanted to confirm the the node names returned by the queries are exactly same to what we get in kubectl get nodes
I haven't had the chance to really look into this as I would most likely need a reproducer to further investigate this bug.
From the look of it, it seems that Prometheus is sometimes returning an empty response to prometheus-adapter. Do you perhaps also have this issue with pods? You should see the following error in the logs if there is something going wrong for the pods: unable to fetch metrics for pods in namespace
.
@dgrisonnet it's only happening for nodes. Also, it always returns error: metrics not available yet
for nodes (and not an intermittent issue); on re-creating the Prometheus Adapter pod, the issue goes away.
Thank you for the clarification, I'll try to reproduce the bug.
@junaid-ali I was able to reproduce this bug with the default configuration from this repository, but it is very outdated so I would recommend using the following one instead: https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/prometheus-adapter-configMap.yaml.
@dgrisonnet I actually copied my nodeQuery
from the link you shared, check my prometheus queries here please: https://github.com/kubernetes-sigs/prometheus-adapter/issues/398#issuecomment-845847789. For example, for CPU I'm using this query:
sum(1 - irate(node_cpu_seconds_total{mode="idle"}[5m]) * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:{}) by (node)
The only difference is, I'm not using the LabelMatchers
since that was causing an issue (I was getting instance as label key with the node name as value, and prometheus adapter was expecting node
as label key instead of instance
)
For the cpu query, the labelMatchers should match node
and not instance. As for memory, we have some relabeling in place in kube-prometheus for node-exporter: https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/node-exporter-serviceMonitor.yaml
I'll try to reproduce with your query, but with the one from kube-prometheus I wasn't able to so far.
@dgrisonnet I did print the cpuQuery
and memQuery
to debug this. So when this issue happens e.g with the error missing CPU info for node ...
in adapter logs, the cpuQuery
value looks like this:
{%!q(*naming.resourceConverter=&{{{0 0} 0 0 0 0} map[instance:{ nodes} namespace:{ namespaces} node:{ nodes} pod:{ pods}] map[{ namespaces}:namespace { nodes}:instance { pods}:pod]....... "container"}
Compared to when the metrics work fine, the cpuQuery
value looks like this:
{%!q(*naming.resourceConverter=&{{{0 0} 0 0 0 0} map[instance:{ nodes} namespace:{ namespaces} node:{ nodes} pod:{ pods}] map[{ namespaces}:namespace { nodes}:node { pods}:pod]..... "container"}
So, the difference is the value of { nodes}
(please scroll right the above query snippets to see the difference in bold)
I do find a similar query difference in memQuery
when this happens with memory metrics.
Because of this, it tries to get the node name using instance
as a label, while the label is node
; so it fails to get the node name at here -> returns an empty value. So there seems to be some issue in this NewResourceConverter
, which I couldn't figure out exactly.
NOTE: I'm overriding instance to node via overrides
, adapter does work without this issue most of the time, but sometimes when the adapter pod is recreated or even when it is created the first time, we see the above issue.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
im still seeing this, even after changing instance to node in overrides, i tried putting it back as well, and restarting the pod. i still get the same result. only top pods works. i provided some output here, logs, values in a similar issue, etc:
https://github.com/kubernetes-sigs/prometheus-adapter/issues/385#issuecomment-924152312
i undid the node overrides and put it back to how the readme has it, seems to have resolved the issue for me
I was facing the same problem with Amazon EKS version 1.21-eks.2
with both, prometheus-server
and prometheus-adapter
installed using the community Charts, using the example in the README. The version are as below:
$ helm ls -n prometheus-system
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
prometheus prometheus-system 1 2021-10-12 01:27:02.127990141 +0000 UTC deployed prometheus-14.9.2 2.26.0
prometheus-adapter prometheus-system 6 2021-10-14 19:20:03.800011211 +0000 UTC deployed prometheus-adapter-2.17.0 v0.9.0
Following the workaround proposed by @junaid-ali, I was able to make it work changing the association of resource nodes
to the label instance
(instead of the original node
). My value file is currently like this:
prometheus:
path: ""
port: 80
url: http://prometheus-server.prometheus-system.svc
rules:
resource:
cpu:
containerLabel: container
containerQuery: sum(rate(container_cpu_usage_seconds_total{<<.LabelMatchers>>,
container!=""}[3m])) by (<<.GroupBy>>)
nodeQuery: sum(rate(container_cpu_usage_seconds_total{<<.LabelMatchers>>, id='/'}[3m]))
by (<<.GroupBy>>)
resources:
overrides:
instance:
resource: node
namespace:
resource: namespace
pod:
resource: pod
memory:
containerLabel: container
containerQuery: sum(container_memory_working_set_bytes{<<.LabelMatchers>>, container!=""})
by (<<.GroupBy>>)
nodeQuery: sum(container_memory_working_set_bytes{<<.LabelMatchers>>,id='/'})
by (<<.GroupBy>>)
resources:
overrides:
instance:
resource: node
namespace:
resource: namespace
pod:
resource: pod
window: 3m
After that, I'm now able to query resource metrics with kubectl top nodes/pods
:
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-10-1-10-60.ec2.internal 47m 2% 657Mi 9%
ip-10-1-14-239.ec2.internal 76m 3% 1121Mi 16%
$ kubectl top pods -A
NAMESPACE NAME CPU(cores) MEMORY(bytes)
kube-system aws-node-2h4k6 3m 47Mi
kube-system aws-node-fdspx 3m 48Mi
kube-system coredns-66cb55d4f4-7g7x4 0m 10Mi
kube-system coredns-66cb55d4f4-7wzsc 1m 9Mi
kube-system kube-proxy-fd9ps 0m 13Mi
kube-system kube-proxy-fsbwq 0m 13Mi
prometheus-system prometheus-adapter-8bcbbfb8b-gv8m8 10m 39Mi
prometheus-system prometheus-alertmanager-787f86875f-x9skk 0m 12Mi
prometheus-system prometheus-kube-state-metrics-58c5cd6ddb-666td 0m 11Mi
prometheus-system prometheus-node-exporter-4rh98 0m 7Mi
prometheus-system prometheus-node-exporter-5xv2s 0m 7Mi
prometheus-system prometheus-pushgateway-6bd6fcd9b8-m4nmg 0m 7Mi
prometheus-system prometheus-server-648c978678-9dbbx 13m 370Mi
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
I don't see the issue in v0.9.1
We met the same issue in EKS 1.21 and prometheus-adapter version v0.9.1
The error logs of prometheus-adaptor
E0408 09:18:26.254304 1 provider.go:191] unable to fetch CPU metrics for pod monitoring/prometheus-adapter-7dc46dd46d-vs2zd, skipping
E0408 09:18:26.254309 1 provider.go:191] unable to fetch CPU metrics for pod monitoring/prometheus-k8s-0, skipping
E0408 09:18:26.254315 1 provider.go:191] unable to fetch CPU metrics for pod monitoring/prometheus-k8s-1, skipping
The command kubectl top node
error: metrics not available yet
And the command kubectl top po -n monitoring
error: Metrics not available for pod monitoring/grafana-79f58457b6-lr4jn, age: 1h19m39.326246s
Then content of prometheus adapter
containers:
- args:
- --cert-dir=/var/run/serving-cert
- --config=/etc/adapter/config.yaml
- --logtostderr=true
- --metrics-relist-interval=1m
- --prometheus-url=http://prometheus-k8s.monitoring.svc:9090/
- --secure-port=6443
- --tls-cipher-suites=TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA
image: k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.9.1
The content of prometheus
spec:
containers:
- args:
- --web.console.templates=/etc/prometheus/consoles
- --web.console.libraries=/etc/prometheus/console_libraries
- --config.file=/etc/prometheus/config_out/prometheus.env.yaml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention.time=30d
- --web.enable-lifecycle
- --web.route-prefix=/
- --web.config.file=/etc/prometheus/web_config/web-config.yaml
image: quay.io/prometheus/prometheus:v2.29.1
Could someone give us some clue that how to debug this issue?
I had to patch both my grafana datasource for prometheus and the deployment/prometheus-adapter yaml to get this to work consistently when using the release-0.10 quickstart manifests:
replacing:
kubectl patch deployment/prometheus-adapter -n monitoring --type json -p='[
{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--prometheus-url=http://prometheus-operated.monitoring.svc:9090/"}
]'
export DATASOURCE=$(echo '{
"apiVersion": 1,
"datasources": [
{
"access": "proxy",
"editable": false,
"name": "prometheus",
"orgId": 1,
"type": "prometheus",
"url": "http://prometheus-operated.monitoring.svc:9090",
"version": 1
}
]
}' | base64 | tr -d '\n')
echo $DATASOURCE
echo $DATASOURCE | base64 -d
kubectl patch secret/grafana-datasources -n monitoring --type json -p='[{"op": "replace", "path": "/data/datasources.yaml", "value": "ewogICAgImFwaVZlcnNpb24iOiAxLAogICAgImRhdGFzb3VyY2VzIjogWwogICAgICAgIHsKICAgICAgICAgICAgImFjY2VzcyI6ICJwcm94eSIsCiAgICAgICAgICAgICJlZGl0YWJsZSI6IGZhbHNlLAogICAgICAgICAgICAibmFtZSI6ICJwcm9tZXRoZXVzIiwKICAgICAgICAgICAgIm9yZ0lkIjogMSwKICAgICAgICAgICAgInR5cGUiOiAicHJvbWV0aGV1cyIsCiAgICAgICAgICAgICJ1cmwiOiAiaHR0cDovL3Byb21ldGhldXMtb3BlcmF0ZWQubW9uaXRvcmluZy5zdmM6OTA5MCIsCiAgICAgICAgICAgICJ2ZXJzaW9uIjogMQogICAgICAgIH0KICAgIF0KfQo="}]'
kubectl rollout restart -n monitoring deployment grafana
Hopefully this helps someone else!
@junaid-ali I was able to reproduce this bug with the default configuration from this repository, but it is very outdated so I would recommend using the following one instead: https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/prometheus-adapter-configMap.yaml.
Since Kube Prometheus release new Version, please refer to the new URL:
v0.9: https://github.com/prometheus-operator/kube-prometheus/blob/release-0.9/manifests/node-exporter-serviceMonitor.yaml v0.11: https://github.com/prometheus-operator/kube-prometheus/blob/release-0.11/manifests/nodeExporter-serviceMonitor.yaml
I know this issue has already been closed, but none of the fixes here did help in my case. I found a way to get metrics to work again and thought it may help others in the future. In my setup I have a firewall with iptables enabled and dropping TCP packets if they do not match open ports.
To make metrics work again I had to allow TCP traffic on port 9090 on each kubernetes node.
# iptables -I INPUT -p tcp -m state --state NEW -m tcp --dport 9090 -j ACCEPT
This issue is still present in v0.10.0
edit: Got this working. Running v0.10.0
on EKS 1.23, using kube-prometheus-stack
. I needed to add a relabeling config (https://github.com/prometheus-community/helm-charts/blob/0b928f341240c76d8513534035a825686ed28a4b/charts/kube-prometheus-stack/values.yaml#L471) to the ServiceMonitor for node-exporter
prometheus-node-exporter:
prometheus:
monitor:
relabelings:
- sourceLabels: [__meta_kubernetes_pod_node_name]
separator: ;
regex: ^(.*)$
targetLabel: node
replacement: $1
action: replace
After that I used this form of the query (https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/deploy/manifests/config-map.yaml)
This issue is still present in
v0.10.0
edit: Got this working. Runningv0.10.0
on EKS 1.23, usingkube-prometheus-stack
. I needed to add a relabeling config (https://github.com/prometheus-community/helm-charts/blob/0b928f341240c76d8513534035a825686ed28a4b/charts/kube-prometheus-stack/values.yaml#L471) to the ServiceMonitor for node-exporterprometheus-node-exporter: prometheus: monitor: relabelings: - sourceLabels: [__meta_kubernetes_pod_node_name] separator: ; regex: ^(.*)$ targetLabel: node replacement: $1 action: replace
After that I used this form of the query (https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/deploy/manifests/config-map.yaml)
@FalconerTC What is your
rules:
resource:
?
@nicraMarcin I don't declare any additional rules
What happened? Deployed Prometheus Adapter (
v0.8.4
) via helm chart on EKS (v1.18.16-eks-7737de
) with 2 replicas.1
replica is returning a result forkubectl top nodes
:but the other replica throws the following error:
The logs in that replica:
Manually, running the same query (
api.go
@14:08:18.931997
from logs) to prometheus server from inside both replicas, does return the same result:Did you expect to see some different? Both replicas should be able to return the node metrics for
kubectl top nodes
when the node query is working fine.How to reproduce it (as minimally and precisely as possible): Not really sure, I deleted the pod and the issue go away, but it still happens every now and then (usually with new pods?)
Environment
AWS EKS