mlco2 / codecarbon

Track emissions from Compute and recommend ways to reduce their impact on the environment.
https://mlco2.github.io/codecarbon
MIT License
1k stars 157 forks source link

Catch error from pynvml.nvmlDeviceGetTotalEnergyConsumption #586

Open inimaz opened 1 week ago

inimaz commented 1 week ago

Description

This is a follow-up issue discovered by #578.

Whenever we call pynvml.nvmlDeviceGetTotalEnergyConsumption it could throw an exception. Logs taken from the other issue:

[codecarbon WARNING @ 11:19:56] Invalid gpu_ids format. Expected a string or a list of ints.
[codecarbon INFO @ 11:19:56] [setup] RAM Tracking...
[codecarbon INFO @ 11:19:56] [setup] GPU Tracking...
[codecarbon INFO @ 11:19:56] Tracking Nvidia GPU via pynvml
Traceback (most recent call last):
  File "/home/ainhoa.vivel/TFM/transfer_learning/baseline.py", line 191, in <module>
    tracker = EmissionsTracker(project_name="baseline")
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/emissions_tracker.py", line 296, in __init__
    gpu_devices = GPU.from_utils(self._gpu_ids)
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/external/hardware.py", line 121, in from_utils
    return cls(gpu_ids=gpu_ids)
  File "<string>", line 4, in __init__
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/external/hardware.py", line 63, in __post_init__
    self.devices = AllGPUDevices()
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/core/gpu.py", line 186, in __init__
    gpu_device = GPUDevice(handle=handle, gpu_index=i)
  File "<string>", line 8, in __init__
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/core/gpu.py", line 24, in __post_init__
    self.last_energy = self._get_energy_kwh()
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/core/gpu.py", line 28, in _get_energy_kwh
    return Energy.from_millijoules(self._get_total_energy_consumption())
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/core/gpu.py", line 95, in _get_total_energy_consumption
    return pynvml.nvmlDeviceGetTotalEnergyConsumption(self.handle)
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/pynvml/nvml.py", line 2411, in nvmlDeviceGetTotalEnergyConsumption
    _nvmlCheckReturn(ret)
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/pynvml/nvml.py", line 833, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError: System is not in ready state

Goal

It looks like this error is not properly catched. If this error appears, we should catch it and skip the GPU measurements.