Closed gcorrall closed 8 months ago
Thanks @gcorrall for this issue and the proposed fix. I'll see to implement it soon, hopefully in the coming days if I can dedicate some time at work (or if you're willing to propose a PR).
I've had a quick first go at implementing this, and have made a pull request. It will certainly need checking, but I have tested it on machines using psys and pkg and it seems to handle the wrap around correctly for total power and cpu power.
Thanks a lot @gcorrall, I'll review the code and check it on my machines, and merge it after that.
I've noticed that on Intel machines with RAPL, powerjoular will periodically log negative values for total power and cpu power. After some testing I think this is because powerjoular is not taking account of when the energy counter (energy_uj) wraps around (reaches max_energy_range_uj). The calculation RAPL_After.total_energy - RAPL_Before.total_energy then becomes negative, as the RAPL_After value is smaller than the RAPL_Before value. On machines with a heavy load this can happen quite frequently.
You can observe this with:
When the energy_uj value reaches the max_energy_range_uj value powerjoular logs a negative value; then continues as normal until the next wrap around.
This is a similar issue to https://github.com/mlco2/codecarbon/issues/322, and I imagine it should be fixed in the same way (https://github.com/mlco2/codecarbon/pull/323). If the 'before' power value is less than the 'after' power value then max_energy_range_uj should be added to RAPL_Energy (RAPL_After.total_energy - RAPL_Before.total_energy).
I think you would have to take into account three possible wraparounds: