prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.23k stars 2.36k forks source link

Slow response, high CPU usage by Prometheus due to high cardinality / series count on cpu, cpu_quest #3161

Closed rkusniri closed 3 weeks ago

rkusniri commented 3 weeks ago

Host operating system: output of uname -a

Linux 5.4.17-2136.327.2

node_exporter version: output of node_exporter --version

node_exporter, version 1.8.1 (branch: HEAD, revision: 400c3979931613db930ea035f39ce7b377cdbb5b) build date: 20240521-18:36:22 go version: go1.22.3 platform: linux/amd64 tags: unknown

node_exporter command line flags

defaults

node_exporter log output

no errors

Are you running node_exporter in Docker?

no

What did you do that produced an error?

running CPU collector as default

What did you expect to see?

interested in overall node CPU usage each CPU mode

What did you see instead?

CPU stats per core only

For hundreds of instances of size with hundreds of CPU cores Visualization ( prompting ) to display basics / whole node simple view on utilization by modes only is at time of the request very slow / CPU demanding if this needs to be done for several hours period ( of 2 seconds sampling because of data LIVE streaming ) for all of the modes at once. Even when this processing is transformed to recording rules which requires very low evaluation _interval ( seconds ), it creates excessive overhead on CPU resources on prometheus server due to amount of series and frequency of data.

Please allow to have explicit metrics for overall node CPU usage stats per mode.

Thank you

SuperQ commented 3 weeks ago

Sorry, this is not something we plan to support.

SuperQ commented 3 weeks ago

Please read about recording rules.

rkusniri commented 3 weeks ago

I have tried to indicated that issue as well. Recording rules due to amount of involved series (cpu, cpu_quest 3500+ per single node), required calculations ( calculate now from cpu_seconds rate) and frequency of evaluations ( required at least every 5 seconds ) creates excessive unnecessary resource consumption on prometheus server. Is there any other effective/efficient ways beside not feasible direct or recording one to gather node overall CPU statistics by mode?

discordianfish commented 2 weeks ago

For questions/help/support please use our community channels. There are more people available to potentially respond to your request and the whole community can benefit from the answers provided.