Measure GPU consumption

bpetit commented 3 years ago

Problem

Some power hungry use cases rely on GPU. It would be great to propose to measure its consumption from the infrastructure point of view.

Solution

We can inspire from codecarbon by using pynvml.

Alternatives

Any other library existing would be worth a look.

Additional context

The idea is to make easier collecting those metrics from the infrastructure and thus feed metrics pipelines that may make easier exposing their impact to cloud providers machine learning clients.

uggla commented 3 years ago

Hello, I did a couple of investigations on this topic. There is a wrapper of nvml library written in Rust here: https://crates.io/crates/nvml-wrapper so getting info from an Nvidia board looks not really complcated. I have extracted and updaded the example provided to extract the power usage: https://github.com/uggla/nvml-basic Unfortunately, I have the following output:

 uggla   main  ~  workspace  rust  nvml-basic  cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/basic`

Your NVIDIA GeForce GTX 1050 is currently sitting at 40 °C with a graphics clock of 139 MHz and a memory
clock of 405 MHz.
Memory usage is 4.92 MB out of an available 2.1 GB.
Right now the device is connected via a PCIe gen 1 x16 interface;
the max your hardware supports is PCIe gen 3 x16.
Power consumption is Not supported.

This device is not on a multi-GPU board.

System CUDA version: 11.3

So I manage to get data from my 1050 board but the power usage is not supported. :( I have read that it can be a limitation of the driver. I expect more a limitation of my hardware. It would be great is someone could run this short code example on a different GPU before going ahead with the scaphandre implementation.

demeringo commented 3 years ago

Hi, this is neat !

Your feedback triggered my curiosity to test nvml-wrapper on an AWS EC2 instance that uses nvidia GPU.

Disclaimer: my knowledge or experience of GPU or related driver is absolutely zero. So if you find anything that does not make sense below, please tell me ;-)

EC2 instance

g3.4xlarge
eu-west-1
all defaults settings
using AWS provided AMI that comes with nvida tesla driver preinstalled amzn2-ami-graphics-hvm-2.0.20210427.0-x86_64-gp2-e6724620-3ffb-4cc9-9690-c310d8e794ef

First attempt: libnvidia-ml.so not found

It did not work out of the box (complaining about missing libnvidia-ml.so).

root@ip-172-31-3-186 nvml-basic]# cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/basic`
Error: LibloadingError(DlOpen { desc: "libnvidia-ml.so: cannot open shared object file: No such file or directory" })

Second attempt: create a symlink to the lib

I did a couple of things to make it work

created the LD_LIBRARY_PATH env variable (export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/lib64) but it did not work either
created a symlink ln -s /usr/lib64/libnvidia-ml.so.1 /usr/lib64/libnvidia-ml.so

Relaunched and we have a measure:

[root@ip-172-31-3-186 nvml-basic]# cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/basic`

Your Tesla M60 is currently sitting at 21 °C with a graphics clock of 405 MHz and a memory clock of 324 MHz. 
Memory usage is 0 B out of an available 7.99 GB. 
Right now the device is connected via a PCIe gen 1 x16 interface; the max your hardware supports is PCIe gen 3 x16. 
Power consumption is 14599.

This device is not on a multi-GPU board.

System CUDA version: 11.0

In retrospect I am not sure if creating the LD_LIBRARY_PATH was of any use.

Using nvidia-smi command

While trying to this work I came accross the nvidia-smi command (See https://serverfault.com/questions/395455/how-to-check-gpu-usages-on-aws-ec2-gpu-instance)

I tried running nvidia-smi -i 0 -l -q -d POWER which returned results in the same range (+- 14 watts idle). I do not know how the calculation is done but it displays a measure summary every second (I include 3 successive outputs below).

nvidia-smi -i 0 -l -q -d POWER

==============NVSMI LOG==============

Timestamp                                 : Mon May 10 22:25:27 2021
Driver Version                            : 450.119.01
CUDA Version                              : 11.0

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 13.88 W
        Power Limit                       : 150.00 W
        Default Power Limit               : 150.00 W
        Enforced Power Limit              : 150.00 W
        Min Power Limit                   : 112.50 W
        Max Power Limit                   : 162.00 W
    Power Samples
        Duration                          : 40.52 sec
        Number of Samples                 : 119
        Max                               : 14.73 W
        Min                               : 13.39 W
        Avg                               : 14.07 W

==============NVSMI LOG==============

Timestamp                                 : Mon May 10 22:25:32 2021
Driver Version                            : 450.119.01
CUDA Version                              : 11.0

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 15.08 W
        Power Limit                       : 150.00 W
        Default Power Limit               : 150.00 W
        Enforced Power Limit              : 150.00 W
        Min Power Limit                   : 112.50 W
        Max Power Limit                   : 162.00 W
    Power Samples
        Duration                          : 40.52 sec
        Number of Samples                 : 119
        Max                               : 14.73 W
        Min                               : 13.39 W
        Avg                               : 14.08 W

==============NVSMI LOG==============

Timestamp                                 : Mon May 10 22:25:37 2021
Driver Version                            : 450.119.01
CUDA Version                              : 11.0

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 13.88 W
        Power Limit                       : 150.00 W
        Default Power Limit               : 150.00 W
        Enforced Power Limit              : 150.00 W
        Min Power Limit                   : 112.50 W
        Max Power Limit                   : 162.00 W
    Power Samples
        Duration                          : 40.22 sec
        Number of Samples                 : 119
        Max                               : 14.73 W
        Min                               : 13.39 W
        Avg                               : 14.07 W

I have no idea if the results are relevant for of an idle machine. But I find very exciting that we are able to probe something out of an AWS instance using GPU ;-)

I think it would be interesting to redo the test with some kind of representative workload, and also verify if it works the same with other providers like azure or gcp.

uggla commented 3 years ago

Hello @demeringo,

Thank you really much this is really helpful. Sorry I knew about the missing libnvidia-ml.so, but forget to mention it in the previous post. 14W idle for such card seems clearly possible and not completely wrong.

As I will not be able to fully test it with my laptop, I will mock the GPU results. Though will you be able to do a test with scaphandre as soon as I will implement the GPU power reporting ? That will be great.

I think it would be interesting to redo the test with some kind of representative workload, and also verify if it works the same with other providers like azure or gcp.

Absolutelly but I dont't think there is a reason it will be different between providers as soon as it is nvidia GPU hardware. Another interesting test would be with multiple GPU in order to know how the library react in such case.

demeringo commented 3 years ago

Yes, this would be perfect, I can setup different public cloud servers for testing during a limited time... but I lack rust skills to do the integration... so if you could take it I would be more than happy to test a branch ;-)

mindrunner commented 3 years ago

I am happy to test on a bare metal box with a 1050ti (if testing is feasable in production mode)

However, it seem that power-draw might not be supported by some cards :(

==============NVSMI LOG==============

Timestamp                                 : Tue May 11 10:24:09 2021
Driver Version                            : 465.27
CUDA Version                              : 11.3

Attached GPUs                             : 1
GPU 00000000:02:00.0
    Power Readings
        Power Management                  : Supported
        Power Draw                        : N/A
        Power Limit                       : 75.00 W
        Default Power Limit               : 75.00 W
        Enforced Power Limit              : 75.00 W
        Min Power Limit                   : 52.50 W
        Max Power Limit                   : 75.00 W
    Power Samples
        Duration                          : 18446744073707.55 sec
        Number of Samples                 : 119
        Max                               : 35.50 W
        Min                               : 35.50 W
        Avg                               : 0.00 W

uggla commented 3 years ago

However, it seem that power-draw might not be supported by some cards :(

@mindrunner, yes we have the same issue, I have a 1050 (not Ti) on my laptop it is not supported. That's the reason why I requested people with different HW to check.

on a bare metal box

Is your 1050Ti an embedded chip on a laptop or solder on a motherboard, or a "real" card plugged on pci express bus ? I understand that it is the last option, but this is just to be sure.

uggla commented 3 years ago

@demeringo ,

Yes, this would be perfect, I can setup different public cloud servers for testing during a limited time... but I lack rust skills to do the integration... so if you could take it I would be more than happy to test a branch ;-)

Super cool, I'll notify you as soon as I have something usable. I just need to find a bit of spare time to handle it....

mindrunner commented 3 years ago

Yeah, Laptop cards have a different PM, also due to the fact they are usually driven next to an intel card and so on...

The card in my Laptop let's me read the power draw: (01:00.0 VGA compatible controller: NVIDIA Corporation TU106GLM [Quadro RTX 3000 Mobile / Max-Q] (rev a1))

==============NVSMI LOG==============

Timestamp                                 : Tue May 11 18:44:53 2021
Driver Version                            : 465.27
CUDA Version                              : 11.3

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Power Readings
        Power Management                  : N/A
        Power Draw                        : 12.45 W
        Power Limit                       : N/A
        Default Power Limit               : N/A
        Enforced Power Limit              : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Power Samples
        Duration                          : Not Found
        Number of Samples                 : Not Found
        Max                               : Not Found
        Min                               : Not Found
        Avg                               : Not Found

The card I was talking about in my previous post is a "normal" PCIe card: (02:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1)) It is a PALIT GeForce® GTX 1050 Ti KalmX 4GB passive, just for reference.

uggla commented 3 years ago

@mindrunne, thank you. This is really helpful. I think 1050 is entry level hardware, so maybe that's the reason why there is no power sensor. Or maybe there is one which is disabled by the driver....

Yahoo, you have a laptop with a Quadro chip. It seems to be a high end laptop. I did not know that laptop can have such kind of chip.

mindrunner commented 3 years ago

Yahoo, you have a laptop with a Quadro chip. It seems to be a high end laptop. I did not know that laptop can have such kind of chip.

I guess it is pretty high end. DELL Precision 5750, the business brother of the 2020 XPS 17....

Anyways, searching the internet about this issue creates even more confusion. Some say, it is a driver issue, supposed to work with an older driver version. Not sure about that. If I figure out more, I will get back in touch here. Would be nice to have the GPU power included into my grafana dashboard, but in my case, really only eyecandy and nothing urgent :D

itwars commented 3 years ago

Hi, You can perhaps have a look to : https://pypi.org/project/pyJoules/ maybe same as pynvml?

uggla commented 3 years ago

Hello @itwars thank you. In fact all these solutions rely on nvml library from Nvidia and the appropriate driver and hardware. The rust nvml wrapping library (https://crates.io/crates/nvml-wrapper) is working very well. So soon scaphandre will be able to report Nvidia GPU consumption. It might take a bit more time than expected as @bpetit and @PierreRust are currently changing some internal stuff.

itwars commented 3 years ago

Excellent! I'm really excited by having GPU power monitoring for my AI GPU powered lab. Any chance to have something similar for both AMD and Intel GPU?

uggla commented 3 years ago

@itwars , it seems only a subset of Nvidia boards support these feature mostly the highend. Regarding Intel and Amd, I have not done really extensive researches but it seems power data are not available. Equivalent libraries to nvml are really limited. Only good news, the one from Amd is open source if I remember well (not the case for nvml). Sounds like energy management was not really a priority for GPU suppliers. Hoping that it will change in a near future.

quantumsheep commented 1 year ago

Hi, is there any news on this issue?

uggla commented 1 year ago

@quantumsheep not really. Do you need this feature ? I would say if someone needs that one I could be motivated to implement it.

quantumsheep commented 1 year ago

@uggla We have some servers with multiple GPUs that we want to get electrical consumption. We can take some time to implement the feature but if you can guide us on how to do it we would love it ❤️

samuelrince commented 1 year ago

Hey @uggla and @quantumsheep I also need this feature! It would be perfect to have it in Scaphandre directly. Currently I rely on this project utkuozdemir/nvidia_gpu_exporter. But it is built around Prometheus and there is no other way to export data (to my knowledge). In Boavizta/boagent we use the JSON exporter from Scaphandre and would like to keep that workflow for GPU metrics as well. Happy to help if I can, but I don't think you can count on my Rust programming skills unfortunately 🙃

uggla commented 1 year ago

Ok, I need to discuss with @bpetit about his plan for the next release. I also need to discuss how Benoit wanted to deal with input plugins. I think this is the main difficulty with this issue. Then I will try to put this issue on the TODO list.

yuxin1234 commented 1 year ago

@uggla @bpetit Any update on this issue? Thanks @filga

bpetit commented 1 year ago

Hi !

I have a lot to catch up this thread, sorry !

@uggla don't hesitate to open a PR on dev, we are not so much on internals changes these days, more new features, so there shouldn't be too much conflicts.

I'll be more than happy to look at your PR soon after next release.

hubblo-org / scaphandre