ml-energy / zeus

Deep Learning Energy Measurement and Optimization
https://ml.energy/zeus
Apache License 2.0
180 stars 24 forks source link

On using `nvmlDeviceGetTotalEnergyConsumption` #12

Closed jaywonchung closed 1 year ago

jaywonchung commented 1 year ago

NVML has a method called nvmlDeviceGetTotalEnergyConsumption for GPU architectures Volta and later. With this, there's no need to poll nvmlDeviceGetPowerConsumption; we can just fetch energy right before doing something and right after it finishes and compute the difference.

I have independently verified that power polling then integrating and calling energy twice have at most 1% difference. Fundamentally polling power and integrating will have approximation error. Thus, having the GPU account energy would be better.

Given this,

Thus, it would be beneficial to abstract away the functionality of "querying the energy consumption of the GPU" with a class like ZeusMonitorService (name can be better). This class would receive the GPU index in its constructor (or a list of GPU indices to simultaneously monitor) and expose methods such as def start_profile_window(gpu_index: list[int]) -> None and def end_profile_window(gpu_index: list[int]) -> list[tuple[float, float]]. This is just an example API I quickly came up with.

Given ZeusMonitorService, ZeusDataLoader would be able to leverage the class without having to manage zeus_monitor processes itself. Also, ZeusMonitorContext (Now this name sounds a bit bad!) can use this class in the same way.