Prometheus adapter pod unable to get node metrics

kubernetes-sigs / prometheus-adapter

An implementation of the custom.metrics.k8s.io API using Prometheus

Apache License 2.0

1.92k stars 551 forks source link

Prometheus adapter pod unable to get node metrics #398

Closed junaid-ali closed 2 years ago

junaid-ali commented 3 years ago

What happened? Deployed Prometheus Adapter (v0.8.4) via helm chart on EKS (v1.18.16-eks-7737de) with 2 replicas.

1 replica is returning a result for kubectl top nodes:

$ kubectl top nodes
NAME                   CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-4-227.******   4198m        53%    1289Mi          8%
ip-10-0-4-79.******    1788m        22%    934Mi           6%
ip-10-0-5-164.******   4379m        55%    903Mi           6%
ip-10-0-5-85.******    1666m        21%    926Mi           6%
ip-10-0-6-142.******   3768m        47%    842Mi           5%
ip-10-0-6-209.******   1654m        20%    908Mi           6%

but the other replica throws the following error:

$ kubectl top nodes
error: metrics not available yet

The logs in that replica:

$ $ kubectl -n monitoring-adapter logs prometheus-adapter-57d96ff446-97wbw  -f
.
.
.
I0519 14:08:18.929794       1 handler.go:143] prometheus-metrics-adapter: GET "/apis/metrics.k8s.io/v1beta1/nodes" satisfied by gorestful with webservice /apis/metrics.k8s.io/v1beta1
I0519 14:08:18.931997       1 api.go:74] GET http://prometheus-kube-prometheus-prometheus.default.svc:9090/prometheus/api/v1/query?query=sum%28%28node_memory_MemTotal_bytes%7Bjob%3D%22node-exporter%22%7D+-+node_memory_MemAvailable_bytes%7Bjob%3D%22node-exporter%22%7D%29+%2A+on+%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7B%7D%29+by+%28node%29&time=1621433298.929 200 OK
I0519 14:08:18.932353       1 api.go:74] GET http://prometheus-kube-prometheus-prometheus.default.svc:9090/prometheus/api/v1/query?query=sum%281+-+irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B5m%5D%29+%2A+on%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7B%7D%29+by+%28node%29&time=1621433298.929 200 OK
I0519 14:08:18.932759       1 provider.go:282] missing memory for node "ip-10-0-4-227.******", skipping
I0519 14:08:18.932775       1 provider.go:282] missing memory for node "ip-10-0-4-79.******", skipping
I0519 14:08:18.932780       1 provider.go:282] missing memory for node "ip-10-0-5-164.******", skipping
I0519 14:08:18.932785       1 provider.go:282] missing memory for node "ip-10-0-5-85.******", skipping
I0519 14:08:18.932790       1 provider.go:282] missing memory for node "ip-10-0-6-142.******", skipping
I0519 14:08:18.932796       1 provider.go:282] missing memory for node "ip-10-0-6-209.******", skipping
I0519 14:08:18.932905       1 httplog.go:89] "HTTP" verb="GET" URI="/apis/metrics.k8s.io/v1beta1/nodes" latency="3.582715ms" userAgent="kubectl/v1.18.0 (darwin/amd64) kubernetes/9e99141" srcIP="10.0.6.74:39976" resp=200
.
.
.

Manually, running the same query (api.go @ 14:08:18.931997 from logs) to prometheus server from inside both replicas, does return the same result:

$ kubectl -n monitoring-adapter exec -it prometheus-adapter-57d96ff446-97wbw sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
/ $ wget -qO- http://prometheus-kube-prometheus-prometheus.default.svc:9090/prometheus/api/v1/query?query=sum%28%28node_memory_MemTotal_bytes%7Bjob%3D%22node-exporter%22%7D+-+node_memory_MemAvailable_bytes%7Bjob%3D%22node-exporter%22%7D%29+%2A+on+%28namespace%2C+pod%29+gr
oup_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7B%7D%29+by+%28node%29
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"node":"ip-10-0-4-227.******"},"value":[1621431478.914,"1353449472"]},{"metric":{"node":"ip-10-0-4-79.******"},"value":[1621431478.914,"1070182400"]},{"metric":{"node":"ip-10-0-5-164.******"},"value":[1621431478.914,"1006329856"]},{"metric":{"node":"ip-10-0-5-85.******"},"value":[1621431478.914,"938311680"]},{"metric":{"node":"ip-10-0-6-142.******"},"value":[1621431478.914,"877047808"]},{"metric":{"node":"ip-10-0-6-209.******"},"value":[1621431478.914,"956456960"]}]}}/ $

Did you expect to see some different? Both replicas should be able to return the node metrics for kubectl top nodes when the node query is working fine.

How to reproduce it (as minimally and precisely as possible): Not really sure, I deleted the pod and the issue go away, but it still happens every now and then (usually with new pods?)

Environment

Kubernetes version information:

lient Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.16-eks-7737de", GitCommit:"7737de131e58a68dda49cdd0ad821b4cb3665ae8", GitTreeState:"clean", BuildDate:"2021-03-10T21:33:25Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes cluster kind:

AWS EKS

dgrisonnet commented 3 years ago

Hey @junaid-ali, I noticed that the query your executed inside of the replica was slightly different from the one in the logs, it is missing &time=1621433298.929 which fixes the timestamp for which prometheus-adapter gets metrics from Prometheus. Could you try again with it? Also, if you try executing the query in the Prometheus UI, do you see any gaps in the graph that could induce some scrape failures from Prometheus?

junaid-ali commented 3 years ago

@dgrisonnet I tried with timestamp as well, and that did return the same result. Here's a recent one:

$ wget -qO- http://prometheus-kube-prometheus-prometheus.default.svc:9090/prometheus/api/v1/query?query=sum%281+-+irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B5m%5D%29+%2A+on%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7B%7D%29+by+%28node%29\&time=1621591878.122.     #Added a slash with ampersand sign
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"node":"ip-10-0-6-55.******"},"value":[1621591878.122,"0.05399999999935978"]},{"metric":{"node":"ip-10-0-4-20.******"},"value":[1621591878.122,"0.05233333333308099"]},{"metric":{"node":"ip-10-0-4-227.******"},"value":[1621591878.122,"0.1836666666630964"]},{"metric":{"node":"ip-10-0-5-164.******"},"value":[1621591878.122,"0.1860000000005433"]},{"metric":{"node":"ip-10-0-5-175.******"},"value":[1621591878.122,"1.7483333333337212"]},{"metric":{"node":"ip-10-0-6-138.******"},"value":[1621591878.122,"0.0543333333330035"]}]}}

$ wget -qO- http://prometheus-kube-prometheus-prometheus.default.svc:9090/prometheus/api/v1/query?query=sum%28%28node_memory_MemTotal_bytes%7Bjob%3D%22node-exporter%22%7D+-+node_memory_MemAvailable_bytes%7Bjob%3D%22node-exporter%22%7D%29+%2A+on+%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7B%7D%29+by+%28node%29\&time=1621591880.903 #Added a slash with ampersand sign
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"node":"ip-10-0-4-227.******"},"value":[1621591880.903,"1425924096"]},{"metric":{"node":"ip-10-0-5-164.******"},"value":[1621591880.903,"1008308224"]},{"metric":{"node":"ip-10-0-5-175.******"},"value":[1621591880.903,"1755410432"]},{"metric":{"node":"ip-10-0-6-138.******"},"value":[1621591880.903,"842174464"]},{"metric":{"node":"ip-10-0-6-55.******"},"value":[1621591880.903,"843423744"]},{"metric":{"node":"ip-10-0-4-20.******"},"value":[1621591880.903,"896737280"]}]}}

At the moment, there's only one replica, and I'm getting the following error in the logs:

.
.
.
I0521 10:13:50.737655       1 handler.go:143] prometheus-metrics-adapter: GET "/apis/metrics.k8s.io/v1beta1/nodes" satisfied by gorestful with webservice /apis/metrics.k8s.io/v1beta1
I0521 10:13:50.739757       1 api.go:74] GET http://prometheus-kube-prometheus-prometheus.default.svc:9090/prometheus/api/v1/query?query=sum%28%28node_memory_MemTotal_bytes%7Bjob%3D%22node-exporter%22%7D+-+node_memory_MemAvailable_bytes%7Bjob%3D%22node-exporter%22%7D%29+%2A+on+%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7B%7D%29+by+%28node%29&time=1621592030.737 200 OK
I0521 10:13:50.740106       1 api.go:74] GET http://prometheus-kube-prometheus-prometheus.default.svc:9090/prometheus/api/v1/query?query=sum%281+-+irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B5m%5D%29+%2A+on%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7B%7D%29+by+%28node%29&time=1621592030.737 200 OK
I0521 10:13:50.740283       1 provider.go:277] missing CPU for node "ip-10-0-4-20.******", skipping
I0521 10:13:50.740298       1 provider.go:277] missing CPU for node "ip-10-0-4-227.******", skipping
I0521 10:13:50.740306       1 provider.go:277] missing CPU for node "ip-10-0-5-164.******", skipping
I0521 10:13:50.740314       1 provider.go:277] missing CPU for node "ip-10-0-5-175.******", skipping
I0521 10:13:50.740321       1 provider.go:277] missing CPU for node "ip-10-0-6-138.******", skipping
I0521 10:13:50.740329       1 provider.go:277] missing CPU for node "ip-10-0-6-55.******", skipping
I0521 10:13:50.740420       1 httplog.go:89] "HTTP" verb="GET" URI="/apis/metrics.k8s.io/v1beta1/nodes" latency="3.400931ms" userAgent="kubectl/v1.18.0 (darwin/amd64) kubernetes/9e99141" srcIP="10.0.6.74:59194" resp=200

No issues with running the queries directly in the prometheus UI:

CPU query
sum(1 - irate(node_cpu_seconds_total{mode="idle"}[5m]) * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:{}) by (node)

CPU query result:
{node="ip-10-0-6-138.******"} | 0.05699999999997085
{node="ip-10-0-6-55.******"} | 0.05233333333308099
{node="ip-10-0-4-20.******"} | 0.05433333333348844
{node="ip-10-0-4-227.******"} | 0.33099999999394636
{node="ip-10-0-5-164.******"} | 0.15966666666596818
{node="ip-10-0-5-175.******"} | 3.2099999999996123

Memory query:
sum((node_memory_MemTotal_bytes{job="node-exporter"} - node_memory_MemAvailable_bytes{job="node-exporter"}) * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:{}) by (node)

Memory query result:
{node="ip-10-0-5-164.******"} | 1000628224
{node="ip-10-0-5-175.******"} | 892575744
{node="ip-10-0-6-138.******"} | 845381632
{node="ip-10-0-6-55.******"} | 843325440
{node="ip-10-0-4-20.******"} | 896249856
{node="ip-10-0-4-227.******} | 1429225472

junaid-ali commented 3 years ago

@dgrisonnet have you had a chance to look at the previous reply? Also, wanted to confirm the the node names returned by the queries are exactly same to what we get in kubectl get nodes

dgrisonnet commented 3 years ago

I haven't had the chance to really look into this as I would most likely need a reproducer to further investigate this bug.

From the look of it, it seems that Prometheus is sometimes returning an empty response to prometheus-adapter. Do you perhaps also have this issue with pods? You should see the following error in the logs if there is something going wrong for the pods: unable to fetch metrics for pods in namespace.

junaid-ali commented 3 years ago

@dgrisonnet it's only happening for nodes. Also, it always returns error: metrics not available yet for nodes (and not an intermittent issue); on re-creating the Prometheus Adapter pod, the issue goes away.

dgrisonnet commented 3 years ago

Thank you for the clarification, I'll try to reproduce the bug.

dgrisonnet commented 3 years ago

@junaid-ali I was able to reproduce this bug with the default configuration from this repository, but it is very outdated so I would recommend using the following one instead: https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/prometheus-adapter-configMap.yaml.

junaid-ali commented 3 years ago

@dgrisonnet I actually copied my nodeQuery from the link you shared, check my prometheus queries here please: https://github.com/kubernetes-sigs/prometheus-adapter/issues/398#issuecomment-845847789. For example, for CPU I'm using this query:

sum(1 - irate(node_cpu_seconds_total{mode="idle"}[5m]) * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:{}) by (node)

The only difference is, I'm not using the LabelMatchers since that was causing an issue (I was getting instance as label key with the node name as value, and prometheus adapter was expecting node as label key instead of instance)

dgrisonnet commented 3 years ago

For the cpu query, the labelMatchers should match node and not instance. As for memory, we have some relabeling in place in kube-prometheus for node-exporter: https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/node-exporter-serviceMonitor.yaml

I'll try to reproduce with your query, but with the one from kube-prometheus I wasn't able to so far.

junaid-ali commented 3 years ago

@dgrisonnet I did print the cpuQuery and memQuery to debug this. So when this issue happens e.g with the error missing CPU info for node ... in adapter logs, the cpuQuery value looks like this:

{%!q(*naming.resourceConverter=&{{{0 0} 0 0 0 0} map[instance:{ nodes} namespace:{ namespaces} node:{ nodes} pod:{ pods}] map[{ namespaces}:namespace { nodes}:instance { pods}:pod]  ....... "container"}

Compared to when the metrics work fine, the cpuQuery value looks like this:

{%!q(*naming.resourceConverter=&{{{0 0} 0 0 0 0} map[instance:{ nodes} namespace:{ namespaces} node:{ nodes} pod:{ pods}] map[{ namespaces}:namespace { nodes}:node { pods}:pod]  ..... "container"}

So, the difference is the value of { nodes} (please scroll right the above query snippets to see the difference in bold)

I do find a similar query difference in memQuery when this happens with memory metrics.

Because of this, it tries to get the node name using instance as a label, while the label is node; so it fails to get the node name at here -> returns an empty value. So there seems to be some issue in this NewResourceConverter, which I couldn't figure out exactly.

NOTE: I'm overriding instance to node via overrides, adapter does work without this issue most of the time, but sometimes when the adapter pod is recreated or even when it is created the first time, we see the above issue.

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

perezjasonr commented 3 years ago

im still seeing this, even after changing instance to node in overrides, i tried putting it back as well, and restarting the pod. i still get the same result. only top pods works. i provided some output here, logs, values in a similar issue, etc:

https://github.com/kubernetes-sigs/prometheus-adapter/issues/385#issuecomment-924152312

perezjasonr commented 3 years ago

i undid the node overrides and put it back to how the readme has it, seems to have resolved the issue for me

davivcgarcia commented 3 years ago

I was facing the same problem with Amazon EKS version 1.21-eks.2 with both, prometheus-server and prometheus-adapter installed using the community Charts, using the example in the README. The version are as below:

$ helm ls -n prometheus-system 
NAME                    NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                       APP VERSION
prometheus              prometheus-system       1               2021-10-12 01:27:02.127990141 +0000 UTC deployed        prometheus-14.9.2           2.26.0     
prometheus-adapter      prometheus-system       6               2021-10-14 19:20:03.800011211 +0000 UTC deployed        prometheus-adapter-2.17.0   v0.9.0

Following the workaround proposed by @junaid-ali, I was able to make it work changing the association of resource nodes to the label instance (instead of the original node). My value file is currently like this:

prometheus:
  path: ""
  port: 80
  url: http://prometheus-server.prometheus-system.svc
rules:
  resource:
    cpu:
      containerLabel: container
      containerQuery: sum(rate(container_cpu_usage_seconds_total{<<.LabelMatchers>>,
        container!=""}[3m])) by (<<.GroupBy>>)
      nodeQuery: sum(rate(container_cpu_usage_seconds_total{<<.LabelMatchers>>, id='/'}[3m]))
        by (<<.GroupBy>>)
      resources:
        overrides:
          instance:
            resource: node
          namespace:
            resource: namespace
          pod:
            resource: pod
    memory:
      containerLabel: container
      containerQuery: sum(container_memory_working_set_bytes{<<.LabelMatchers>>, container!=""})
        by (<<.GroupBy>>)
      nodeQuery: sum(container_memory_working_set_bytes{<<.LabelMatchers>>,id='/'})
        by (<<.GroupBy>>)
      resources:
        overrides:
          instance:
            resource: node
          namespace:
            resource: namespace
          pod:
            resource: pod
    window: 3m

After that, I'm now able to query resource metrics with kubectl top nodes/pods:

$ kubectl top nodes
NAME                          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
ip-10-1-10-60.ec2.internal    47m          2%     657Mi           9%        
ip-10-1-14-239.ec2.internal   76m          3%     1121Mi          16%       

$ kubectl top pods -A
NAMESPACE           NAME                                             CPU(cores)   MEMORY(bytes)   
kube-system         aws-node-2h4k6                                   3m           47Mi            
kube-system         aws-node-fdspx                                   3m           48Mi            
kube-system         coredns-66cb55d4f4-7g7x4                         0m           10Mi            
kube-system         coredns-66cb55d4f4-7wzsc                         1m           9Mi             
kube-system         kube-proxy-fd9ps                                 0m           13Mi            
kube-system         kube-proxy-fsbwq                                 0m           13Mi            
prometheus-system   prometheus-adapter-8bcbbfb8b-gv8m8               10m          39Mi            
prometheus-system   prometheus-alertmanager-787f86875f-x9skk         0m           12Mi            
prometheus-system   prometheus-kube-state-metrics-58c5cd6ddb-666td   0m           11Mi            
prometheus-system   prometheus-node-exporter-4rh98                   0m           7Mi             
prometheus-system   prometheus-node-exporter-5xv2s                   0m           7Mi             
prometheus-system   prometheus-pushgateway-6bd6fcd9b8-m4nmg          0m           7Mi             
prometheus-system   prometheus-server-648c978678-9dbbx               13m          370Mi

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/prometheus-adapter/issues/398#issuecomment-992823571): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

junaid-ali commented 2 years ago

I don't see the issue in v0.9.1

richzw commented 2 years ago

We met the same issue in EKS 1.21 and prometheus-adapter version v0.9.1

The error logs of prometheus-adaptor

E0408 09:18:26.254304       1 provider.go:191] unable to fetch CPU metrics for pod monitoring/prometheus-adapter-7dc46dd46d-vs2zd, skipping
E0408 09:18:26.254309       1 provider.go:191] unable to fetch CPU metrics for pod monitoring/prometheus-k8s-0, skipping
E0408 09:18:26.254315       1 provider.go:191] unable to fetch CPU metrics for pod monitoring/prometheus-k8s-1, skipping

The command kubectl top node

error: metrics not available yet

And the command kubectl top po -n monitoring

error: Metrics not available for pod monitoring/grafana-79f58457b6-lr4jn, age: 1h19m39.326246s

Then content of prometheus adapter

  containers:
  - args:
    - --cert-dir=/var/run/serving-cert
    - --config=/etc/adapter/config.yaml
    - --logtostderr=true
    - --metrics-relist-interval=1m
    - --prometheus-url=http://prometheus-k8s.monitoring.svc:9090/
    - --secure-port=6443
    - --tls-cipher-suites=TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA
    image: k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.9.1

The content of prometheus

spec:
  containers:
  - args:
    - --web.console.templates=/etc/prometheus/consoles
    - --web.console.libraries=/etc/prometheus/console_libraries
    - --config.file=/etc/prometheus/config_out/prometheus.env.yaml
    - --storage.tsdb.path=/prometheus
    - --storage.tsdb.retention.time=30d
    - --web.enable-lifecycle
    - --web.route-prefix=/
    - --web.config.file=/etc/prometheus/web_config/web-config.yaml
    image: quay.io/prometheus/prometheus:v2.29.1

Could someone give us some clue that how to debug this issue?

brentmjohnson commented 2 years ago

I had to patch both my grafana datasource for prometheus and the deployment/prometheus-adapter yaml to get this to work consistently when using the release-0.10 quickstart manifests:

replacing:

http://prometheus-k8s.monitoring.svc:9090 with
http://prometheus-operated.monitoring.svc:9090 (service endpoint)

kubectl patch deployment/prometheus-adapter -n monitoring --type json -p='[
    {"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--prometheus-url=http://prometheus-operated.monitoring.svc:9090/"}
]'

export DATASOURCE=$(echo '{
    "apiVersion": 1,
    "datasources": [
        {
            "access": "proxy",
            "editable": false,
            "name": "prometheus",
            "orgId": 1,
            "type": "prometheus",
            "url": "http://prometheus-operated.monitoring.svc:9090",
            "version": 1
        }
    ]
}' | base64 | tr -d '\n')
echo $DATASOURCE
echo $DATASOURCE | base64 -d
kubectl patch secret/grafana-datasources -n monitoring --type json -p='[{"op": "replace", "path": "/data/datasources.yaml", "value": "ewogICAgImFwaVZlcnNpb24iOiAxLAogICAgImRhdGFzb3VyY2VzIjogWwogICAgICAgIHsKICAgICAgICAgICAgImFjY2VzcyI6ICJwcm94eSIsCiAgICAgICAgICAgICJlZGl0YWJsZSI6IGZhbHNlLAogICAgICAgICAgICAibmFtZSI6ICJwcm9tZXRoZXVzIiwKICAgICAgICAgICAgIm9yZ0lkIjogMSwKICAgICAgICAgICAgInR5cGUiOiAicHJvbWV0aGV1cyIsCiAgICAgICAgICAgICJ1cmwiOiAiaHR0cDovL3Byb21ldGhldXMtb3BlcmF0ZWQubW9uaXRvcmluZy5zdmM6OTA5MCIsCiAgICAgICAgICAgICJ2ZXJzaW9uIjogMQogICAgICAgIH0KICAgIF0KfQo="}]'
kubectl rollout restart -n monitoring deployment grafana

Hopefully this helps someone else!

minhthinhls commented 2 years ago

@junaid-ali I was able to reproduce this bug with the default configuration from this repository, but it is very outdated so I would recommend using the following one instead: https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/prometheus-adapter-configMap.yaml.

Since Kube Prometheus release new Version, please refer to the new URL:

v0.9: https://github.com/prometheus-operator/kube-prometheus/blob/release-0.9/manifests/node-exporter-serviceMonitor.yaml v0.11: https://github.com/prometheus-operator/kube-prometheus/blob/release-0.11/manifests/nodeExporter-serviceMonitor.yaml

sys-ops commented 1 year ago

I know this issue has already been closed, but none of the fixes here did help in my case. I found a way to get metrics to work again and thought it may help others in the future. In my setup I have a firewall with iptables enabled and dropping TCP packets if they do not match open ports.

To make metrics work again I had to allow TCP traffic on port 9090 on each kubernetes node.

# iptables -I INPUT -p tcp -m state --state NEW -m tcp --dport 9090 -j ACCEPT

FalconerTC commented 1 year ago

This issue is still present in v0.10.0 edit: Got this working. Running v0.10.0 on EKS 1.23, using kube-prometheus-stack. I needed to add a relabeling config (https://github.com/prometheus-community/helm-charts/blob/0b928f341240c76d8513534035a825686ed28a4b/charts/kube-prometheus-stack/values.yaml#L471) to the ServiceMonitor for node-exporter

  prometheus-node-exporter:
    prometheus:
      monitor:
        relabelings:
          - sourceLabels: [__meta_kubernetes_pod_node_name]
            separator: ;
            regex: ^(.*)$
            targetLabel: node
            replacement: $1
            action: replace

After that I used this form of the query (https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/deploy/manifests/config-map.yaml)

nicraMarcin commented 1 year ago

This issue is still present in v0.10.0 edit: Got this working. Running v0.10.0 on EKS 1.23, using kube-prometheus-stack. I needed to add a relabeling config (https://github.com/prometheus-community/helm-charts/blob/0b928f341240c76d8513534035a825686ed28a4b/charts/kube-prometheus-stack/values.yaml#L471) to the ServiceMonitor for node-exporter
  prometheus-node-exporter:
    prometheus:
      monitor:
        relabelings:
          - sourceLabels: [__meta_kubernetes_pod_node_name]
            separator: ;
            regex: ^(.*)$
            targetLabel: node
            replacement: $1
            action: replace
After that I used this form of the query (https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/deploy/manifests/config-map.yaml)

@FalconerTC What is your

rules:
  resource:

FalconerTC commented 1 year ago

@nicraMarcin I don't declare any additional rules