hplt-project / OpusTrainer

Curriculum training
https://pypi.org/project/opustrainer/
MIT License
15 stars 5 forks source link

monitoring training costs #3

Open jorgtied opened 1 year ago

jorgtied commented 1 year ago

add functionality to

jelmervdl commented 1 year ago

As I mentioned on Mattermost, I think these are all available through nvidia-smi, e.g. nvidia-smi stats gives you a csv stream of stats.

I also quickly checked LUMI, which has a similar utility called rocm-smi, and something like rocm-smi -fPtu --showmemuse --showvoltage --json seems to give you a snapshot at the moment of calling.

Edit: in terms of how to integrate this with the rest… I was thinking of some sort of general event database/log and also store things like "N lines passed onto trainer", "Restarted reading dataset X", "marian validation score is now X blue" into that. In the AWS/Cloud world I think the ELK stack is commonly used for this. Don't think we'd need that kind of scale (and I just want to have it all built-in in this repo ideally…) but might be a source for inspiration.

jorgtied commented 1 year ago

For NVIDIA there is an energy consumption counter that can be checked before and after the training process:

#!/usr/bin/python3
from pynvml import (
    nvmlInit, nvmlDeviceGetCount, nvmlDeviceGetHandleByIndex,
    nvmlDeviceGetTotalEnergyConsumption, nvmlShutdown
)

nvmlInit()

deviceCount = nvmlDeviceGetCount()
for i in range(deviceCount):
    handle = nvmlDeviceGetHandleByIndex(i)
    energy = nvmlDeviceGetTotalEnergyConsumption(handle)
    print(f"GPU {i}: {energy} mJ")
nvmlShutdown()

requires:

pip install nvidia-ml-py --user