utkuozdemir / nvidia_gpu_exporter

Nvidia GPU exporter for prometheus using nvidia-smi binary
MIT License
884 stars 104 forks source link

Grafana customizable GPU uuid #124

Open Cyria7 opened 1 year ago

Cyria7 commented 1 year ago

First of all I would like to say this repo is a great work and it partly solved my requirements. Since my lab's compute cards are all distributed under different hosts with different ip's, I can't very clearly tell which gpu belongs to which server by uuid. So I'm wondering that is the name of the switching gpu in the top left corner of the dashboard customizable?

abbottjlu commented 2 months ago

The cmd nvidia-smi -L can output,

GPU 0: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-0Af501e0-4fa6-bc45-5993-92f20a1f76Xe)
GPU 1: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-1A61a9a-19f7-9670-bcc2-c0e3a6074cX8)
GPU 2: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-aAc30fe0e-d40c-2fb3-a4d5-1136a76066X7)
GPU 3: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-8Adeb867-b3ee-37b9-b81b-2c432e01cbX4)

To enhance readability, mapping the UUID to GPU indices (e.g., GPU 0, GPU 1) would be beneficial.

Furthermore, considering that GPUs might be distributed across multiple servers, a notation like GPU X@hostname would provide more context and clarity.

utkuozdemir commented 2 months ago

The cmd nvidia-smi -L can output,

GPU 0: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-0Af501e0-4fa6-bc45-5993-92f20a1f76Xe)
GPU 1: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-1A61a9a-19f7-9670-bcc2-c0e3a6074cX8)
GPU 2: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-aAc30fe0e-d40c-2fb3-a4d5-1136a76066X7)
GPU 3: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-8Adeb867-b3ee-37b9-b81b-2c432e01cbX4)

To enhance readability, mapping the UUID to GPU indices (e.g., GPU 0, GPU 1) would be beneficial.

Furthermore, considering that GPUs might be distributed across multiple servers, a notation like GPU X@hostname would provide more context and clarity.

Those indexes are probably not consistent, i.e., they can come in different order. Even if they don't most of the time, swap the PCI slots of 2 GPUs and then I bet they'd appear in different order.

So I'm not sure tbh if this exporter should address this. I think you can achieve what you need by relabel_configs of Prometheis.

abbottjlu commented 2 months ago

The cmd nvidia-smi -L can output,

GPU 0: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-0Af501e0-4fa6-bc45-5993-92f20a1f76Xe)
GPU 1: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-1A61a9a-19f7-9670-bcc2-c0e3a6074cX8)
GPU 2: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-aAc30fe0e-d40c-2fb3-a4d5-1136a76066X7)
GPU 3: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-8Adeb867-b3ee-37b9-b81b-2c432e01cbX4)

To enhance readability, mapping the UUID to GPU indices (e.g., GPU 0, GPU 1) would be beneficial. Furthermore, considering that GPUs might be distributed across multiple servers, a notation like GPU X@hostname would provide more context and clarity.

Those indexes are probably not consistent, i.e., they can come in different order. Even if they don't most of the time, swap the PCI slots of 2 GPUs and then I bet they'd appear in different order.

So I'm not sure tbh if this exporter should address this. I think you can achieve what you need by relabel_configs of Prometheis.

When running then cmd sinfo -N -O NodeList,CPUsState,CPUsLoad,Memory,FreeMem,AllocMem,Partition,StateCompact,Gres:25,GresUsed:40 | column -t, I consistently see the following output:

NODELIST  CPUS(A/I/O/T)  CPU_LOAD  MEMORY   FREE_MEM  ALLOCMEM  PARTITION  STATE  GRES                  GRES_USED
gpu01     0/48/0/48      0.08      192676   181531    0         dev        idle   gpu:geforce:8(S:0-1)  gpu:geforce:0(IDX:N/A)
gpu02     31/17/0/48     8.29      192676   159145    62000     gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:8(IDX:0-7)
gpu03     48/0/0/48      2.26      192676   162526    96000     gpu*       alloc  gpu:geforce:8(S:0-1)  gpu:geforce:2(IDX:0,4)
gpu04     31/17/0/48     8.17      192676   160330    62000     gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:8(IDX:0-7)
gpu05     47/1/0/48      10.02     192676   157138    105552    gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:5(IDX:0,4-7)
gpu06     8/40/0/48      8.18      192676   169780    16000     gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:8(IDX:0-7)
gpu07     13/35/0/48     13.20     192676   169248    37552     gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:8(IDX:0-7)
gpu08     38/10/0/48     15.18     192676   159122    99104     gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:5(IDX:0-2,4,7)

I hypothesized that the IDX value would remain unchanged across reboots. Given the stability of my hardware setup, I believe the IDX should be deterministic.

I understand that the database currently uses GPU UUIDs as primary keys. Would it be feasible to retain UUID-based storage while dynamically resolving the corresponding IDX at query time?

I think you can achieve what you need by relabel_configs of Prometheis.

Thanks for the suggestion. I'm still relatively new to Grafana/Prometheus (about three days in), so I'll need to do some more research on relabel_configs to figure out how to implement it.