mlco2 / codecarbon

Track emissions from Compute and recommend ways to reduce their impact on the environment.
https://mlco2.github.io/codecarbon
MIT License
1.18k stars 178 forks source link

nvmlDeviceGetTotalEnergyConsumption not supported for Titan X #608

Closed hreesteeyahn closed 4 months ago

hreesteeyahn commented 4 months ago

Description

I am trying to recreate the results of Luccioni, Sasha, et al. “Power Hungry Processing: Watts Driving the Cost of AI Deployment?” The 2024 ACM Conference on Fairness, Accountability, and Transparency, ACM, 2024. Crossref, https://doi.org/10.1145/3630106.3658542. as baseline results for my research thesis. When I try to run this script from the provided code repo: https://github.com/sashavor/co2_inference/blob/main/code/qa/qa_squadv2.py, I am getting this issue:

{ "name": "NVMLError_NotSupported", "message": "Not Supported", "stack": "--------------------------------------------------------------------------- NVMLError_NotSupported Traceback (most recent call last) Cell In[7], line 38 36 for model in qa_models: 37 print(model) ---> 38 tracker = EmissionsTracker(project_name=model, measure_power_secs=1, logging_logger=_logger, output_file='./qa_squadv2.csv') 39 tracker.start() 40 tracker.start_task(\"load model\")

File /lib/python3.10/site-packages/codecarbon/emissions_tracker.py:284, in BaseEmissionsTracker.init(self, project_name, measure_power_secs, api_call_interval, api_endpoint, api_key, output_dir, output_file, save_to_file, save_to_api, save_to_logger, logging_logger, save_to_prometheus, prometheus_url, gpu_ids, emissions_endpoint, experiment_id, experiment_name, co2_signal_api_token, tracking_mode, log_level, on_csv_write, logger_preamble, default_cpu_power, pue) 282 if gpu.is_gpu_details_available(): 283 logger.info(\"Tracking Nvidia GPU via pynvml\") --> 284 gpu_devices = GPU.from_utils(self._gpu_ids) 285 self._hardware.append(gpu_devices) 286 gpu_names = [n[\"name\"] for n in gpu_devices.devices.get_gpu_static_info()]

File /lib/python3.10/site-packages/codecarbon/external/hardware.py:121, in GPU.from_utils(cls, gpu_ids) 119 @classmethod 120 def from_utils(cls, gpu_ids: Optional[List] = None) -> \"GPU\": --> 121 return cls(gpu_ids=gpu_ids)

File :4, in init(self, gpu_ids)

File /lib/python3.10/site-packages/codecarbon/external/hardware.py:63, in GPU.post_init(self) 62 def post_init(self): ---> 63 self.devices = AllGPUDevices() 64 self.num_gpus = self.devices.device_count 65 self._total_power = Power( 66 0 # It will be 0 until we call for the first time measure_power_and_energy 67 )

File /lib/python3.10/site-packages/codecarbon/core/gpu.py:208, in AllGPUDevices.init(self) 206 for i in range(self.device_count): 207 handle = pynvml.nvmlDeviceGetHandleByIndex(i) --> 208 gpu_device = GPUDevice(handle=handle, gpu_index=i) 209 self.devices.append(gpu_device)

File :8, in init(self, handle, gpu_index, energy_delta, power, last_energy)

File /lib/python3.10/site-packages/codecarbon/core/gpu.py:46, in GPUDevice.post_init(self) 45 def post_init(self): ---> 46 self.last_energy = self._get_energy_kwh() 47 self._init_static_details()

File /lib/python3.10/site-packages/codecarbon/core/gpu.py:50, in GPUDevice._get_energy_kwh(self) 49 def _get_energy_kwh(self): ---> 50 return Energy.from_millijoules(self._get_total_energy_consumption())

File /lib/python3.10/site-packages/codecarbon/core/gpu.py:117, in GPUDevice._get_total_energy_consumption(self) 113 def _get_total_energy_consumption(self): 114 \"\"\"Returns total energy consumption for this GPU in millijoules (mJ) since the driver was last reloaded 115 https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g732ab899b5bd18ac4bfb93c02de4900a 116 \"\"\" --> 117 return pynvml.nvmlDeviceGetTotalEnergyConsumption(self.handle)

File /lib/python3.10/site-packages/pynvml/nvml.py:2411, in nvmlDeviceGetTotalEnergyConsumption(handle) 2409 fn = _nvmlGetFunctionPointer(\"nvmlDeviceGetTotalEnergyConsumption\") 2410 ret = fn(handle, byref(c_millijoules)) -> 2411 _nvmlCheckReturn(ret) 2412 return c_millijoules.value

File /watts/lib/python3.10/site-packages/pynvml/nvml.py:833, in _nvmlCheckReturn(ret) 831 def _nvmlCheckReturn(ret): 832 if (ret != NVML_SUCCESS): --> 833 raise NVMLError(ret) 834 return ret

NVMLError_NotSupported: Not Supported" }

It seems that the issue is with the GPU I am using (NVIDIA Geforce GTX Titan X) not being supported for nvmlDeviceGetTotalEnergyConsumption.

Are there any workarounds that you are aware of?

benoit-cty commented 4 months ago

You may have a look to https://forums.developer.nvidia.com/t/bug-nvml-incorrectly-detects-certain-gpus-as-unsupported/30165/4 (and https://github.com/CFSworks/nvml_fix )

There seems to be a way to path Nvidia code to support GPU that they seems to don't want to support :exploding_head: