hubblo-org / scaphandre

⚡ Energy consumption metrology agent. Let "scaph" dive and bring back the metrics that will help you make your systems and applications more sustainable !
Apache License 2.0
1.62k stars 109 forks source link

MSR based sensor #78

Open bpetit opened 3 years ago

bpetit commented 3 years ago

Problem

We noticed that for kernel older than 5.11 on AMD Zen CPUs, perf is working fine and scaphandre is not. The reason is that kernel drivers for amd cpus needed by powercap to get rapl data are not present before 5.11. Perf relies on MSR.

Solution

We need an MSR based sensor to be able to get metrics when powercap rapl is not a solution.

Alternatives

None that I think of, but discussions are welcome.

Additional context

None.

PierreRust commented 3 years ago

Is there any reason to want an msr-based sensor instead on a sensor based on perf_open (https://www.man7.org/linux/man-pages/man2/perf_event_open.2.html) ?

It seems to me that perf_open have lower right requirements : a /proc/sys/kernel/perf_event_paranoid value of less than 1 or the CAP_PERFMON (since Linux 5.8) or CAP_SYS_ADMIN capability. MSR reading on the other hand, requires root, which will generally quite hard to obtain on production systems.

paulgay commented 3 years ago

Hi,

what do you think about this (maybe old) code?

https://github.com/kentcz/rapl-tools

I am not working with AMD but I am facing a similar problem on a 11th Gen Intel(R) Core(TM) i9-11900 where the id of the model is:

printf "0x%X\n" $(cat /proc/cpuinfo | grep model | grep -v name | uniq | cut -d: -f2)
0xA7 

and the os is

Linux arriel 5.8.0-55-generic #62~20.04.1-Ubuntu

In this setup, rapl linux modules does not seem to be supported, however, the code I mentionned is giving me some power measures.

I am currently checking whether these are correct by comparing their values on another machine where rapl linux modules are usable.

I am also having a look at variorum, but it does not seem to solve the support issue

Were you looking for something like this? Or do you think that a portable solution dealing with the different cpu architectures would be much more complicated to implement?

Thanks for your input if you have any clue.

bpetit commented 2 years ago

@PierreRust you are right about the requirements of accessing the MSRs. I guess we should provide both approaches, if possible. I'm currently working on #74 and using the MSRs (indirectly, through a custom driver) is a solution that seems to work properly in that context. On linux however, we should probably give a shot on perf_open !

@paulgay thanks for sharing those tools, I didn't know varorium and it seems worth following the project to see if it can help at some point.

Looking at your CPU, I think this is related to #131 . You should follow this thread instead of the MSR one, which is more an exploratory topic. In your case it seems to be because of powercap changes on the most recent intel CPUs, as discussed in the thread.

Regarding MSR-based sensors, I'm working on one for windows, as mentionned in the previous lines. This will probably not be linux compatible soon, but we could start from that draft to bootstrap something for Linux. Or we could start something using perf_open as @PierreRust mentionned. The later seems more accurate for Linux, but I'm not ready to work on that personally so it's a bit early to say that we will take one direction or another.