Add nvmlDeviceGetTotalEnergyConsumption

arvoelke commented 5 years ago

Retrieves total energy consumption for this GPU in millijoules (mJ) since the driver was last reloaded. Requires Volta or newer.

NVML Documentation: nvmlReturn_t nvmlDeviceGetTotalEnergyConsumption ( nvmlDevice_t device, unsigned long long* energy )

Note: I could not find where this function had been added in the NVML Change Log or in the API Reference Manual -- unsure if there was some underlying reason this had been omitted from the Python bindings.

Tested by running the following code on my notebook (GeForce GTX 1650 with Max-Q, driver version 430.09):

import pynvml as N

N.nvmlInit()

handles = []
for index in range(N.nvmlDeviceGetCount()):
    handles.append(N.nvmlDeviceGetHandleByIndex(index))

for handle in handles:
    energy = N.nvmlDeviceGetTotalEnergyConsumption(handle)
    print(energy)

I've also verified that this returns a NOT_SUPPORTED error if running on a GPU/driver that does not support energy readings (even if it does support power readings).

arvoelke commented 5 years ago

I've also verified that---while the GPU is doing something---the difference in energy reported by nvmlDeviceGetTotalEnergyConsumption over time (mJ/s) is within the same ballpark as the average power draw reported by nvmlDeviceGetPowerUsage (mW):

import time
import numpy as np
import pynvml as N

N.nvmlInit()

handles = []
for index in range(N.nvmlDeviceGetCount()):
    handles.append(N.nvmlDeviceGetHandleByIndex(index))

start_time = time.time()
start_energy = np.asarray([
    N.nvmlDeviceGetTotalEnergyConsumption(handle)
    for handle in handles])

power_readings = []
for i in range(100):
    power_readings.append([
        N.nvmlDeviceGetPowerUsage(handle)
        for handle in handles])
    time.sleep(0.1)

stop_time = time.time()
stop_energy = np.asarray([
    N.nvmlDeviceGetTotalEnergyConsumption(handle)
    for handle in handles])

print((stop_energy - start_energy) / (stop_time - start_time))
print((np.mean(power_readings, axis=0)))

[12812.18567342]
[14048.87]

rjzamora commented 5 years ago

Thanks for adding this @arvoelke ! The wrapper looks good to me.

One request: Can you add a simple pytest, similar to test_nvmlDeviceGetPowerUsage (at a minimum)? I know the existing test suite is far from robust, but it would be great to maintain/improve coverage with each new PR :)

arvoelke commented 5 years ago

One request: Can you add a simple pytest, similar to test_nvmlDeviceGetPowerUsage (at a minimum)? I know the existing test suite is far from robust, but it would be great to maintain/improve coverage with each new PR :)

SGTM. Wondering though.. is there hosted continuous integration testing somewhere, or some minimal GPU specs required for testing? I ran the tests locally on my notebook, but 17 failed -- all with the message pynvml.nvml.NVMLError_NotSupported: Not Supported. Adding a similar unit test for nvmlDeviceGetTotalEnergyConsumption would likewise fail with this error on unsupported GPUs (which may be relatively common).

Maybe this is what you mean by "far from robust" and suggests a future PR that marks unsupported tests something other than failed?

rjzamora commented 5 years ago

Maybe this is what you mean by "far from robust" and suggests a future PR that marks unsupported tests something other than failed?

Thats exaclty right - Right now, it is totally fine if a test fails with a "Not Supported" error. We do not have any CI in place at the moment, but we will need to explicitely skip non-applicable tests once we do.

arvoelke commented 5 years ago

I've added something that should be at least as good as test_nvmlDeviceGetPowerUsage. It assumes the GPU being tested is idling at some non-zero wattage, with energy measurements being exposed to the driver at a rate of at least 1 kHz. I hope these are fair assumptions. It passes for me, and I ran the tests 10 times and got the same pass+fail counts each time.

I was contemplating adding a better unit test such as the one in https://github.com/gpuopenanalytics/pynvml/pull/18#issuecomment-538912304. However, I wasn't sure of any good way to check whether the two values are consistent with each other. With no real workload on the GPU I'm getting a 1W difference between the two numbers. Do you have any sense of what the difference should be here, or know of a way to reliably control this? Similarly, is there a clean way to put a fixed workload on it in order to get some expected power draw and/or energy readings? This might help in the future to improve other unit tests as well (e.g., utilization, memory, etc).

rjzamora commented 5 years ago

Do you have any sense of what the difference should be here, or know of a way to reliably control this? Similarly, is there a clean way to put a fixed workload on it in order to get some expected power draw and/or energy readings? This might help in the future to improve other unit tests as well (e.g., utilization, memory, etc).

It might not be difficult to come up with a ball-park number for idle/active energy-consumption, but I'm not sure if it is worth the effort to calculate these numbers for an arbitrary system. Perhaps it definitely is worth the effort, but I am hoping that relative comparisons (including temporal output comparisons) is sufficient here. With that said, I do think it makes sense to compare to specific numbers in cases that should be system agnostic (e.g. memory consumption).

rjzamora commented 5 years ago

Thanks @arvoelke!

gpuopenanalytics / pynvml

Add nvmlDeviceGetTotalEnergyConsumption #18