prometheus / procfs

procfs provides functions to retrieve system, kernel and process metrics from the pseudo-filesystem proc.
Apache License 2.0
782 stars 320 forks source link

cpu_freq metrics: use scaling or cpuinfo? #122

Closed fbegyn closed 5 years ago

fbegyn commented 5 years ago

Since PR #94 it has the default behaviour of reading scaling_cpu_freq in favor of cpuinfo_cur_freq. These 2 represent different metrics in my eyes. The scaling one reflects at which frequency the linux kernel thinks the cpu runs, while cpuinfo reflects the actual HW frequency of the CPU. source: https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt

IMHO cpuinfo should still be the default value to be read here, or these 2 values should maybe represent 2 different metrics. I'd like to know the reasoning for picking the scaling over the cpuinfo statistic.

mjtrangoni commented 5 years ago

@fbegyn I checked it out, and found that cpuinfo_cur_freq is only readable by root, which is not allowed here, as the node_exporter should be run as user.

See,

# ll /sys/devices/system/cpu/cpu0/cpufreq/*
-r--r--r--. 1 root root 4096 Dec  6 09:46 /sys/devices/system/cpu/cpu0/cpufreq/affected_cpus
-r--------. 1 root root 4096 Dec  6 09:42 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
-r--r--r--. 1 root root 4096 Dec  6 09:42 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
-r--r--r--. 1 root root 4096 Dec  6 09:42 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
-r--r--r--. 1 root root 4096 Dec  6 09:42 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_transition_latency
-r--r--r--. 1 root root 4096 Dec  6 09:46 /sys/devices/system/cpu/cpu0/cpufreq/related_cpus
-r--r--r--. 1 root root 4096 Dec  6 09:46 /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
-r--r--r--. 1 root root 4096 Dec  6 09:46 /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
-r--r--r--. 1 root root 4096 Dec  6 09:46 /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
-rw-r--r--. 1 root root 4096 Dec  6 09:46 /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
-rw-r--r--. 1 root root 4096 Dec  6 09:46 /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
-rw-r--r--. 1 root root 4096 Dec  6 09:46 /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
-rw-r--r--. 1 root root 4096 Dec  6 09:46 /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
# grep . /sys/devices/system/cpu/cpu0/cpufreq/{cpuinfo_cur_freq,scaling_cur_freq}
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq:1200103
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:1200103
SuperQ commented 5 years ago

Interesting, honestly, I didn't look that closely at the documentation. I assumed they were the same, just a rename between different kernel versions. We originally changed the reading preference to work around bugs in the RHEL kernel permissions for the proc file.

@mjtrangoni That's an old redhat kernel bug, on the upstream kernel and more recent patch versions of the RHEL kernel, the permission problem is fixed.

I guess we need to re-read the kernel docs more carefully.

/cc @rtreffer

fbegyn commented 5 years ago

Discovered it while comparing node-exporter metrics from a few months ago with the current ones because we couldn't make sense of it. We changed cpu goverenor to perormance but didn't see the change on the metrics. While the expected behaviour did show up in the data a from a few months ago, it didn't now. Couldn't wrap my head around it, so dove into the kernel a bit and found this out.

I'm not fully comprehending the use for scaling_cur_freq, but definitely see a usecase for cpuinfo_cur_freq.

SuperQ commented 5 years ago

I'll have to review the docs, but after first glance, you're probably right. We shouldn't be using the scaling_ files. sigh

Totally unrelated, performance on Intel doesn't really do what you expect on modern cores. It still uses sleep states to throttle down the CPU. Basically it's a noop on sandy bridge and newer from my testing.

mjtrangoni commented 5 years ago

@SuperQ This is read-only upstream now. See,

$ nl -b a drivers/cpufreq/cpufreq.c|grep -B5 -A5 "cpuinfo_cur_freq, 0400"
   868                          return sprintf(buf, "%u\n", limit);
   869          }
   870          return sprintf(buf, "%u\n", policy->cpuinfo.max_freq);
   871  }
   872
   873  cpufreq_freq_attr_ro_perm(cpuinfo_cur_freq, 0400);
   874  cpufreq_freq_attr_ro(cpuinfo_min_freq);
   875  cpufreq_freq_attr_ro(cpuinfo_max_freq);
   876  cpufreq_freq_attr_ro(cpuinfo_transition_latency);
   877  cpufreq_freq_attr_ro(scaling_available_governors);
   878  cpufreq_freq_attr_ro(scaling_driver);

And here Dave Jones says why,

Reading this file causes reads from hardware on some cpufreq drivers. This can be a slow operation, so a user could degrade system performance for everyone else by repeatedly cat'ing it.

SuperQ commented 5 years ago

@mjtrangoni I think a number of distributions patch that file.

Looking at a couple of my machines, it appears cpuinfo_cur_freq doesn't even exist on on them (>= 4.15).

It seems like my systems have the pcc_cpufreq0 module loaded.

knweiss commented 5 years ago

@SuperQ FWIW, quoting Documentation/admin-guide/pm/cpufreq.rst:

cpuinfo_cur_freq
        Current frequency of the CPUs belonging to this policy as obtained from
        the hardware (in KHz).

        This is expected to be the frequency the hardware actually runs at.
!       If that frequency cannot be determined, this attribute should not
!       be present.
SuperQ commented 5 years ago

@fbegyn Out of curiosity, what platform/driver are you using?

SuperQ commented 5 years ago

Since the scaling_cur_freq and cpuinfo_cur_freq contain different data, I think we should change the library to read both, and return float pointers if values are found.

SuperQ commented 5 years ago

I've created https://github.com/prometheus/procfs/pull/123 as a possible solution.

fbegyn commented 5 years ago

@fbegyn Out of curiosity, what platform/driver are you using?

@SuperQ Westmere CPUs and Debian 9 (I believe on 4.18 kernel). Personal system is Dell XPS13 9360 on 4.19

pgier commented 5 years ago

Is there anything more to do on this issue since #123 has been merged?

fbegyn commented 5 years ago

@pgier I don't think so. I lost track of this issue, but I'm closing it in relation to #123

jberryman commented 4 years ago

@SuperQ

Totally unrelated, performance on Intel doesn't really do what you expect on modern cores. It still uses sleep states to throttle down the CPU. Basically it's a noop on sandy bridge and newer from my testing.

This is OT but do you have any more details here? On Ivy Bridge I've been trying to figure out how to get a stable cpu frequency for benchmarking (or even figure out how to know if I'm getting a stable clock speed).

I found setting the governor to performance stabilized performance somewhat in my benchmarks (and made things significantly faster), but the various metrics from /sys/... or /proc/cpuinfo are showing cpu frequency bouncing around wildly still.

There seems to be a lot of misinformation about all of this out there...

SuperQ commented 4 years ago

@jberryman Yes, it's very complicated, as there's a lot of trickery going on under the hood in the hardware to perform power and thermal envelope management between cores and other parts of the chip. I don't have any specific documentation links off the top of my head. But I would look around for documentation on the various kernel modules that implement the changes. Things like intel_cstate.