prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
10.64k stars 2.3k forks source link

Pressure metric collection fails on systems that do not expose a full CPU stat #3051

Closed pahaeanx closed 2 weeks ago

pahaeanx commented 3 weeks ago

Looks to me like https://github.com/prometheus/node_exporter/pull/3016 unfortunately broke pressure stats collection on systems that do not expose a full stat for CPU.

In my case this happens on Debian 11. There /proc/pressure/cpu and /sys/fs/cgroup/cpu.pressure do not contain values for full and the collector aborts after failing to collect the pressure stats for CPU full (see log ouput further down).

# cat /sys/fs/cgroup/cpu.pressure /proc/pressure/cpu
some avg10=2.65 avg60=2.78 avg300=2.92 total=111749368752
some avg10=2.65 avg60=2.78 avg300=2.92 total=111749368752

Host operating system: output of uname -a

Debian 11

Linux <snip> 5.10.0-30-amd64 #1 SMP Debian 5.10.218-1 (2024-06-01) x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 1.8.1 (branch: HEAD, revision: 400c3979931613db930ea035f39ce7b377cdbb5b)
  build user:       root@7afbff271a3f
  build date:       20240521-18:36:22
  go version:       go1.22.3
  platform:         linux/amd64
  tags:             unknown

node_exporter command line flags

Pressure collector enabled.

node_exporter log output

Jun 14 06:51:23 <snip> node_exporter[2135275]: ts=2024-06-14T06:51:23.605Z caller=pressure_linux.go:92 level=debug collector=pressure msg="collecting statistics for resource" resource=cpu
Jun 14 06:51:23 <snip> node_exporter[2135275]: ts=2024-06-14T06:51:23.605Z caller=pressure_linux.go:110 level=debug collector=pressure msg="pressure information returned no 'full' data"
Jun 14 06:51:23 <snip> node_exporter[2135275]: ts=2024-06-14T06:51:23.605Z caller=collector.go:167 level=debug msg="collector returned no data" name=pressure duration_seconds=0.0001385 err="collector returned no data"

Are you running node_exporter in Docker?

No

What did you expect to see?

Same behavior we saw with node-exporter version <1.8.1 -- we still collected the rest of the pressure metrics.

Pressure stats collection should continue and simply skip the node_pressure_cpu_stalled_seconds_total metric (I assume that's what is output in case of full CPU stall)

What did you see instead?

No pressure metrics at all. The collector fails with the above error message.

node_scrape_collector_success{collector="pressure"} 0
SuperQ commented 3 weeks ago

Ahh you're right. From the documentation.

CPU full is undefined at the system level, but has been reported since 5.13, so it is set to zero for backward compatibility.

We should allow only some for CPU.