mlco2 / codecarbon

Track emissions from Compute and recommend ways to reduce their impact on the environment.
https://mlco2.github.io/codecarbon
MIT License
1.18k stars 177 forks source link

Catch error from pynvml.nvmlDeviceGetTotalEnergyConsumption #586

Closed inimaz closed 3 months ago

inimaz commented 5 months ago

Description

This is a follow-up issue discovered by #578.

Whenever we call pynvml.nvmlDeviceGetTotalEnergyConsumption it could throw an exception. Logs taken from the other issue:

[codecarbon WARNING @ 11:19:56] Invalid gpu_ids format. Expected a string or a list of ints.
[codecarbon INFO @ 11:19:56] [setup] RAM Tracking...
[codecarbon INFO @ 11:19:56] [setup] GPU Tracking...
[codecarbon INFO @ 11:19:56] Tracking Nvidia GPU via pynvml
Traceback (most recent call last):
  File "/home/ainhoa.vivel/TFM/transfer_learning/baseline.py", line 191, in <module>
    tracker = EmissionsTracker(project_name="baseline")
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/emissions_tracker.py", line 296, in __init__
    gpu_devices = GPU.from_utils(self._gpu_ids)
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/external/hardware.py", line 121, in from_utils
    return cls(gpu_ids=gpu_ids)
  File "<string>", line 4, in __init__
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/external/hardware.py", line 63, in __post_init__
    self.devices = AllGPUDevices()
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/core/gpu.py", line 186, in __init__
    gpu_device = GPUDevice(handle=handle, gpu_index=i)
  File "<string>", line 8, in __init__
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/core/gpu.py", line 24, in __post_init__
    self.last_energy = self._get_energy_kwh()
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/core/gpu.py", line 28, in _get_energy_kwh
    return Energy.from_millijoules(self._get_total_energy_consumption())
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/core/gpu.py", line 95, in _get_total_energy_consumption
    return pynvml.nvmlDeviceGetTotalEnergyConsumption(self.handle)
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/pynvml/nvml.py", line 2411, in nvmlDeviceGetTotalEnergyConsumption
    _nvmlCheckReturn(ret)
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/pynvml/nvml.py", line 833, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError: System is not in ready state

Goal

It looks like this error is not properly catched. If this error appears, we should catch it and skip the GPU measurements.

rosekelly6400 commented 4 months ago

@inimaz Hi, I'd like to start contributing to this project and give this bug fix a shot. Is this issue being worked on by anyone else currently?

inimaz commented 4 months ago

Hi @rosekelly6400! Normally it will be solved by #613 so it is already been taken care of.

Thanks anyway, feel free to reach out if you would like to contribute on any other issue