prometheus-community / helm-charts

Prometheus community Helm charts
Apache License 2.0
5.05k stars 5.01k forks source link

[prometheus-adapter] kubectl top pods works, but top node not #3613

Open pasztorl opened 1 year ago

pasztorl commented 1 year ago

Describe the bug a clear and concise description of what the bug is.

I've installed kube-prometheus-stack, then prometheus-adapter. Then kubectl top pods works, but kubectl top node says "metrics not available yet"

in the log:

I0718 16:44:11.193123       1 provider.go:293] missing CPU for node "cp1.test", skipping
I0718 16:44:11.193149       1 provider.go:293] missing CPU for node "cp2.test", skipping
I0718 16:44:11.193156       1 provider.go:293] missing CPU for node "cp3test", skipping

and also:

E0718 16:45:21.402860       1 writers.go:118] apiserver was unable to write a JSON response: http2: stream closed
E0718 16:45:21.403015       1 writers.go:131] apiserver was unable to write a fallback JSON response: http2: stream closed
E0718 16:45:21.431657       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
E0718 16:45:21.432939       1 writers.go:131] apiserver was unable to write a fallback JSON response: http: Handler timeout

What's your helm version?

not related

What's your kubectl version?

not related

Which chart?

prometheus-adapter

What's the chart version?

4.2.0

What happened?

No response

What you expected to happen?

No response

How to reproduce it?

No response

Enter the changed values of values.yaml?

rules: resource: cpu: containerQuery: | sum by (<<.GroupBy>>) ( rate(container_cpu_usage_seconds_total{container!="",<<.LabelMatchers>>}[3m]) ) nodeQuery: | sum by (<<.GroupBy>>) ( rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal",<<.LabelMatchers>>}[3m]) ) resources: overrides: node: resource: node namespace: resource: namespace pod: resource: pod containerLabel: container memory: containerQuery: | sum by (<<.GroupBy>>) ( avg_over_time(container_memory_working_set_bytes{container!="",<<.LabelMatchers>>}[3m]) ) nodeQuery: | sum by (<<.GroupBy>>) ( avg_over_time(node_memory_MemTotal_bytes{<<.LabelMatchers>>}[3m])

      avg_over_time(node_memory_MemAvailable_bytes{<<.LabelMatchers>>}[3m])
    )
  resources:
    overrides:
      node:
        resource: node
      namespace:
        resource: namespace
      pod:
        resource: pod
  containerLabel: container
window: 3m

Enter the command that you execute and failing/misfunctioning.

installed with ansible helm module (not related)

Anything else we need to know?

No response

ryanobjc commented 10 months ago

Looks like this may be due to a missing 'node' attribute in the node_cpu_seconds_total metric.

ryanobjc commented 10 months ago

I was running into this on EKS, kubernetes version 1.28.1, and I was able to fix it myself by adding this to the values of the kube-prometheus-stack helm deployment:

prometheus-node-exporter:
  prometheus:
    monitor:
      attachMetadata:
        node: true
      relabelings:
      - sourceLabels:
        - __meta_kubernetes_endpoint_node_name
        targetLabel: node
        action: replace
        regex: (.+)
        replacement: ${1}

I believe this bug actually lies in the prometheus-operator config reloader, since this is a default that should be included with the expansion of the servicemonitor configuration. In any case, this works like a charm and now 'kubectl top nodes' works with the prometheus adapter (which i had to install separately from the kube-prometheus-stack, despite the README saying that it's included)

ganchkal commented 5 months ago

@ryanobjc Thanks a lot for your answer! Had the same issue with EKS as well.