joular / powerjoular

PowerJoular allows monitoring power consumption of multiple platforms and processes.
https://www.noureddine.org/research/joular/powerjoular
GNU General Public License v3.0
66 stars 15 forks source link

Negative power values #45

Closed gcorrall closed 8 months ago

gcorrall commented 8 months ago

I've noticed that on Intel machines with RAPL, powerjoular will periodically log negative values for total power and cpu power. After some testing I think this is because powerjoular is not taking account of when the energy counter (energy_uj) wraps around (reaches max_energy_range_uj). The calculation RAPL_After.total_energy - RAPL_Before.total_energy then becomes negative, as the RAPL_After value is smaller than the RAPL_Before value. On machines with a heavy load this can happen quite frequently.

You can observe this with:

cd /sys/class/powercap/intel-rapl/intel-rapl:0/
watch "cat energy_uj; cat max_energy_range_uj"

When the energy_uj value reaches the max_energy_range_uj value powerjoular logs a negative value; then continues as normal until the next wrap around.

This is a similar issue to https://github.com/mlco2/codecarbon/issues/322, and I imagine it should be fixed in the same way (https://github.com/mlco2/codecarbon/pull/323). If the 'before' power value is less than the 'after' power value then max_energy_range_uj should be added to RAPL_Energy (RAPL_After.total_energy - RAPL_Before.total_energy).

I think you would have to take into account three possible wraparounds:

/sys/class/powercap/intel-rapl/intel-rapl:1/max_energy_range_uj # for psys
/sys/class/powercap/intel-rapl/intel-rapl:0/max_energy_range_uj # for pkg
/sys/class/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:2/max_energy_range_uj # for dram
adelnoureddine commented 8 months ago

Thanks @gcorrall for this issue and the proposed fix. I'll see to implement it soon, hopefully in the coming days if I can dedicate some time at work (or if you're willing to propose a PR).

gcorrall commented 8 months ago

I've had a quick first go at implementing this, and have made a pull request. It will certainly need checking, but I have tested it on machines using psys and pkg and it seems to handle the wrap around correctly for total power and cpu power.

adelnoureddine commented 8 months ago

Thanks a lot @gcorrall, I'll review the code and check it on my machines, and merge it after that.