On using `nvmlDeviceGetTotalEnergyConsumption`

NVML has a method called nvmlDeviceGetTotalEnergyConsumption for GPU architectures Volta and later. With this, there's no need to poll nvmlDeviceGetPowerConsumption; we can just fetch energy right before doing something and right after it finishes and compute the difference.

I have independently verified that power polling then integrating and calling energy twice have at most 1% difference. Fundamentally polling power and integrating will have approximation error. Thus, having the GPU account energy would be better.

Given this,

zeus_monitor should first query NVML for the microarchitecture of the GPU, and if it's Volta or later, use the energy method. Otherwise, it should fall back to polling.
ZeusMonitorContext should not really need to spawn the Zeus monitor. It can just call the energy method if available to get the energy consumption of each iteration.

Thus, it would be beneficial to abstract away the functionality of "querying the energy consumption of the GPU" with a class like ZeusMonitorService (name can be better). This class would receive the GPU index in its constructor (or a list of GPU indices to simultaneously monitor) and expose methods such as def start_profile_window(gpu_index: list[int]) -> None and def end_profile_window(gpu_index: list[int]) -> list[tuple[float, float]]. This is just an example API I quickly came up with.

Given ZeusMonitorService, ZeusDataLoader would be able to leverage the class without having to manage zeus_monitor processes itself. Also, ZeusMonitorContext (Now this name sounds a bit bad!) can use this class in the same way.

ml-energy / zeus

On using `nvmlDeviceGetTotalEnergyConsumption` #12