Open jorgtied opened 1 year ago
As I mentioned on Mattermost, I think these are all available through nvidia-smi
, e.g. nvidia-smi stats
gives you a csv stream of stats.
I also quickly checked LUMI, which has a similar utility called rocm-smi
, and something like rocm-smi -fPtu --showmemuse --showvoltage --json
seems to give you a snapshot at the moment of calling.
Edit: in terms of how to integrate this with the rest… I was thinking of some sort of general event database/log and also store things like "N lines passed onto trainer", "Restarted reading dataset X", "marian validation score is now X blue" into that. In the AWS/Cloud world I think the ELK stack is commonly used for this. Don't think we'd need that kind of scale (and I just want to have it all built-in in this repo ideally…) but might be a source for inspiration.
For NVIDIA there is an energy consumption counter that can be checked before and after the training process:
#!/usr/bin/python3
from pynvml import (
nvmlInit, nvmlDeviceGetCount, nvmlDeviceGetHandleByIndex,
nvmlDeviceGetTotalEnergyConsumption, nvmlShutdown
)
nvmlInit()
deviceCount = nvmlDeviceGetCount()
for i in range(deviceCount):
handle = nvmlDeviceGetHandleByIndex(i)
energy = nvmlDeviceGetTotalEnergyConsumption(handle)
print(f"GPU {i}: {energy} mJ")
nvmlShutdown()
requires:
pip install nvidia-ml-py --user
add functionality to