mlco2 / codecarbon

Track emissions from Compute and recommend ways to reduce their impact on the environment.
https://mlco2.github.io/codecarbon
MIT License
1.11k stars 173 forks source link

GPU not found error. pynvml.nvml.NVMLError_NotSupported: Not Supported #518

Closed pankhuriverma closed 5 months ago

pankhuriverma commented 6 months ago

I want to measure the cpu energy consumption of my python code using code carbon. My gpu version is Nvidia GeForce MX250 which does not allow energy monitoring. When I run the code I get this error because the codecarbon code is trying to find the gpu. Screenshot from 2024-03-21 02-16-31

image

But when I run the same code in kaggle notebook with the same codecarbon version and python version, it is able to monitor only the cpu energy consumption when is disabled. Why is this the case?

This is the output from kaggle. Screenshot from 2024-03-21 02-20-47

This is the code that I am running.

import tensorflow as tf

from codecarbon import EmissionsTracker

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential( [ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation="relu"), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10), ] )

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])

tracker = EmissionsTracker(gpu_ids=[]) tracker.start() model.fit(x_train, y_train, epochs=10) emissions: float = tracker.stop() print(emissions)

Please help me identify what the issue is?

inimaz commented 6 months ago

Hello, thanks for using codecarbon! Indeed we should check this further, there should not be any difference. Can you provide the full log of the error when you run it in your machine?

pankhuriverma commented 6 months ago

Hello @inimaz,

Thanks for your response. Below, you will find the complete error log.

2024-03-22 00:56:17.933449: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-03-22 00:56:19.940489: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT /home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/keras/src/layers/reshaping/flatten.py:37: UserWarning: Do not pass an input_shape/input_dim argument to a layer. When using Sequential models, prefer using an Input(shape) object as the first layer in the model instead. super().init(**kwargs) 2024-03-22 00:56:23.694224: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-03-22 00:56:23.825903: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-03-22 00:56:23.826207: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-03-22 00:56:23.827145: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-03-22 00:56:23.827399: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-03-22 00:56:23.827656: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-03-22 00:56:23.928632: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-03-22 00:56:23.928913: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-03-22 00:56:23.929133: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-03-22 00:56:23.929549: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1739 MB memory: -> device: 0, name: NVIDIA GeForce MX250, pci bus id: 0000:3c:00.0, compute capability: 6.1 [codecarbon INFO @ 00:56:24] [setup] RAM Tracking... [codecarbon INFO @ 00:56:24] [setup] GPU Tracking... [codecarbon INFO @ 00:56:24] Tracking Nvidia GPU via pynvml Traceback (most recent call last): File "/home/pankhuri/PycharmProjects/ThesisProject/models/codecarbon_emission_test_nn_model.py", line 25, in tracker = EmissionsTracker(gpu_ids=[]) File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/codecarbon/emissions_tracker.py", line 284, in init gpu_devices = GPU.from_utils(self._gpu_ids) File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/codecarbon/external/hardware.py", line 120, in from_utils return cls(gpu_ids=gpu_ids) File "", line 4, in init File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/codecarbon/external/hardware.py", line 62, in post_init__ self.devices = AllGPUDevices() File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/codecarbon/core/gpu.py", line 186, in init gpu_device = GPUDevice(handle=handle, gpu_index=i) File "", line 8, in init File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/codecarbon/core/gpu.py", line 24, in __post_init self.last_energy = self._get_energy_kwh() File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/codecarbon/core/gpu.py", line 28, in _get_energy_kwh return Energy.from_millijoules(self._get_total_energy_consumption()) File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/codecarbon/core/gpu.py", line 95, in _get_total_energy_consumption return pynvml.nvmlDeviceGetTotalEnergyConsumption(self.handle) File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/pynvml/nvml.py", line 2411, in nvmlDeviceGetTotalEnergyConsumption _nvmlCheckReturn(ret) File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/pynvml/nvml.py", line 833, in _nvmlCheckReturn raise NVMLError(ret) pynvml.nvml.NVMLError_NotSupported: Not Supported

Process finished with exit code 1

inimaz commented 6 months ago

Thanks! I am checking this further. Apparently gpu_ids needs to be a string with all the ids comma-separated.

gpu_ids="" will mean no gpus gpu_ids="1,3,5" will mean gpus 1, 3 and 5.

For now as a workaround for your case, you can give an empty string

tracker = EmissionsTracker(gpu_ids="")
tracker.start()

When time permits we will allow to pass an array of ints.

pankhuriverma commented 6 months ago

Hello @inimaz,

I tried it on my system but it is again giving the same error. I also think it will not work because gpu_ids only accepts a list or None. However, None is also not working in my case.

Screenshot from 2024-03-22 20-44-44

I also tried this it on kaggle but a little differently this time. Screenshot from 2024-03-22 20-15-51 As you can see in the screenshot that I have selected GPU T4 X 2 as the accelerator and passed gpu_ids = "" as input parameter. But codecarbon is still detecting the gpu. Previously when I had run the code on kaggle I had not selected any accelerator from the kaggle menu and thats why it was not detecting. If I am correct, the gpu_ids input parameter is not working as expected.

Below you will find the logs of running the code with GPU T4 X 2 accelerator and gpu_ids = "" parameter.

[codecarbon INFO @ 19:15:11] [setup] RAM Tracking... [codecarbon INFO @ 19:15:11] [setup] GPU Tracking... [codecarbon INFO @ 19:15:11] Tracking Nvidia GPU via pynvml [codecarbon INFO @ 19:15:11] [setup] CPU Tracking... [codecarbon WARNING @ 19:15:11] No CPU tracking mode found. Falling back on CPU constant mode. [codecarbon WARNING @ 19:15:12] We saw that you have a Intel(R) Xeon(R) CPU @ 2.00GHz but we don't know it. Please contact us. [codecarbon INFO @ 19:15:12] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.00GHz [codecarbon INFO @ 19:15:12] >>> Tracker's metadata: [codecarbon INFO @ 19:15:12] Platform system: Linux-5.15.133+-x86_64-with-glibc2.31 [codecarbon INFO @ 19:15:12] Python version: 3.10.13 [codecarbon INFO @ 19:15:12] CodeCarbon version: 2.3.4 [codecarbon INFO @ 19:15:12] Available RAM : 31.358 GB [codecarbon INFO @ 19:15:12] CPU count: 4 [codecarbon INFO @ 19:15:12] CPU model: Intel(R) Xeon(R) CPU @ 2.00GHz [codecarbon INFO @ 19:15:12] GPU count: 2 [codecarbon INFO @ 19:15:12] GPU model: 2 x Tesla T4 [codecarbon INFO @ 19:15:15] Energy consumed for RAM : 0.000000 kWh. RAM Power : 11.759084701538086 W [codecarbon INFO @ 19:15:15] Energy consumed for all GPUs : 0.000000 kWh. Total GPU Power : 0 W [codecarbon INFO @ 19:15:15] Energy consumed for all CPUs : 0.000001 kWh. Total CPU Power : 42.5 W [codecarbon INFO @ 19:15:15] 0.000001 kWh of electricity used since the beginning.

Logs of running the code with no accelerator and gpu_ids = "" parameter.

Screenshot from 2024-03-22 20-28-04

Logs of running the code with no accelerator and without gpu_ids = "" parameter.

Screenshot from 2024-03-22 20-28-04

In the last two cases you can see that it is giving the same output on kaggle.

inimaz commented 5 months ago

I see the confussion, let me explain:

Hope it clarifies!

pankhuriverma commented 5 months ago

Hi Inimaz,

The error screenshot that I provided with 0 kWh energy consumed is from kaggle. All the screenshots with red background are from kaggle. kaggleError

All the screenshots below this statement "I also tried this it on kaggle but a little differently this time." in my previous message are from Kaggle in different scenarios. Screenshot from 2024-04-10 02-37-49 This means sending gpu_ids="" on local and on kaggle is bahaving differently.

I hope you understood my issue this time.

Thanks!