mlco2 / codecarbon

Track emissions from Compute and recommend ways to reduce their impact on the environment.
https://mlco2.github.io/codecarbon
MIT License
1k stars 157 forks source link

pynvml.nvml.NVMLError: System is not in ready state #578

Closed ainhoaVivel closed 1 week ago

ainhoaVivel commented 1 week ago

Description

I wanted to measure the consumption of some mt5-base trainings using CodeCarbon. So far I was using v2.2.0 for training and I didn't have any issue. A few days ago I upgraded to v2.4.2 and it doesn't work anymore. I didn't make any changes to my code. I have tried going back to the previous version of CodeCarbon and I have also tried other versions. I have been able to verify that the error occurs since v2.3.0, when pynvml was introduced.

What I Did

I executed my script

nohup python baseline.py > ./logs/baseline.out 2>&1 &

However, I got this error

[codecarbon WARNING @ 11:19:56] Invalid gpu_ids format. Expected a string or a list of ints.
[codecarbon INFO @ 11:19:56] [setup] RAM Tracking...
[codecarbon INFO @ 11:19:56] [setup] GPU Tracking...
[codecarbon INFO @ 11:19:56] Tracking Nvidia GPU via pynvml
Traceback (most recent call last):
  File "/home/ainhoa.vivel/TFM/transfer_learning/baseline.py", line 191, in <module>
    tracker = EmissionsTracker(project_name="baseline")
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/emissions_tracker.py", line 296, in __init__
    gpu_devices = GPU.from_utils(self._gpu_ids)
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/external/hardware.py", line 121, in from_utils
    return cls(gpu_ids=gpu_ids)
  File "<string>", line 4, in __init__
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/external/hardware.py", line 63, in __post_init__
    self.devices = AllGPUDevices()
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/core/gpu.py", line 186, in __init__
    gpu_device = GPUDevice(handle=handle, gpu_index=i)
  File "<string>", line 8, in __init__
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/core/gpu.py", line 24, in __post_init__
    self.last_energy = self._get_energy_kwh()
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/core/gpu.py", line 28, in _get_energy_kwh
    return Energy.from_millijoules(self._get_total_energy_consumption())
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/codecarbon/core/gpu.py", line 95, in _get_total_energy_consumption
    return pynvml.nvmlDeviceGetTotalEnergyConsumption(self.handle)
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/pynvml/nvml.py", line 2411, in nvmlDeviceGetTotalEnergyConsumption
    _nvmlCheckReturn(ret)
  File "/home/ainhoa.vivel/anaconda3/envs/tl/lib/python3.9/site-packages/pynvml/nvml.py", line 833, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError: System is not in ready state

My python scripts is like this:

if __name__ == "__main__":
    tracker = EmissionsTracker(project_name="baseline")
    tracker.start()
    try:
        baseline(tracker)
    finally:
        tracker.stop()

Inside baseline I have a few tracker.flush(), but nothing else related to CodeCarbon.

I have tried several versions of CodeCarbon and pynvml, but nothing. I can't find any additional information about the System is not in ready state error either. Any idea how to fix this or what causes it?

inimaz commented 1 week ago

Hello @ainhoaVivel! Thanks for using codecarbon and for reporting this.

In this case, if codecarbon worked with versions <2.3.0 but not with versions >2.3.0 I suspect that the pynvml.nvmlDeviceGetTotalEnergyConsumption call never worked. From 2.3.0 onwards we measure the energy of the GPU whereas before we were using only the power, to calculate the emissions.

Could you check if the drivers are well set? By running

nvidia-smi
inimaz commented 1 week ago

Other option is that the drivers are ok but pynvml for some reason does not initialize correctly, you could try something like:

import pynvml

try:
    pynvml.nvmlInit()
    device_count = pynvml.nvmlDeviceGetCount()
    print(f"Number of GPUs available: {device_count}")
    for i in range(device_count):
        handle = pynvml.nvmlDeviceGetHandleByIndex(i)
        info = pynvml.nvmlDeviceGetTotalEnergyConsumption(handle)
        print(f"id:{i}, info:{info}")
    pynvml.nvmlShutdown()
except pynvml.NVMLError as e:
    print(f"Failed to initialize NVML: {str(e)}")
ainhoaVivel commented 1 week ago

Hi @inimaz! Thank you very much for your answer.

I can confirm that drivers are working fine. image

You are right about pynvml. I created a new file with the code you provided and got the same error as when I try to run my LM training script that has CodeCarbon.

Number of GPUs available: 2
Failed to initialize NVML: System is not in ready state

Do you know how this problem could be solved?

inimaz commented 1 week ago

Good thing is that you can reproduce it with that example. Bad thing is that I don't know how to help any further... On codecarbon what we might do is if the call to pynvml.nvmlDeviceGetTotalEnergyConsumption is not succesful, go into some constant mode. On pynvml side... maybe you could open an issue to their repo? https://github.com/gpuopenanalytics/pynvml

ainhoaVivel commented 1 week ago

Okay! It is clear that this issue is not of codecarbon, so I'll ask in pynvml for more information about this error. Thank you very much for you help!