Open TheElectronWill opened 1 year ago
The LIKWID tool does take the overflows into account. Some info here: https://github.com/RRZE-HPC/likwid/issues/13
I saw this issue linked from the powercap project. FYI I think your proposed solution won't work correctly for two reasons:
MSR_RAPL_POWER_UNIT
register. See Section 15.10 of the Intel Software Developer's Manual, Volume 3, March 2023 edition. The standard configuration encodes using the formulation 1/2^ESU
, but some processors are different (particularly some Intel Atom CPUs) as are some domains like DRAM and PSYS on some processors which might have fixed ESU values for those domains that differ from the unit register.I've found that detecting overflow in RAPL can be a challenge. At a minimum, you need to compute the actual max energy value that the MSR register can report and use that value when accounting for overflow, e.g., as done here [1] (full disclosure: my code). I'm not entirely convinced that this always works as expected though, even if you don't "miss" an overflow---I've seen quirky behavior in the past that resulted in overestimating power consumption. It could be that it's not really guaranteed that the register will achieve it's max logical value before it actually turns over, but this approach is at least logically correct modulo bad register behavior. I haven't conducted a rigorous experiment in a long time though, so I'm not sure how prevalent problems might be.
Cheers.
That's right, 64 bits is too much for the MSR counter:
I haven't seen aberrant values when correcting the overflows just after reading the counter, I'll check that again :)
edit: of course using the MSR directly requires to take into account the "quirks" of some platforms, that's what the linux kernel does for perf and powercap (scaphandre uses powercap on linux, for now). These interface return 64bits values because they perform the unit conversion. I'll have to check the overflows in that case. Thanks for the info!
Problem
The RAPL energy counter is incremented and can overflow. Currently, this overflow is not handled.
Currently, the energy measurements are "slightly" (potentially a lot?) wrong. Fixing that might fix other issues where the user complain about "wrong" power usage.
Solution
Instead of ignoring the value, the overflow should be corrected. Quoting \@uggla:
Alternatives
Additional context
See https://github.com/powercap/powercap/issues/3#issuecomment-636256230