Open Cyria7 opened 1 year ago
The cmd nvidia-smi -L
can output,
GPU 0: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-0Af501e0-4fa6-bc45-5993-92f20a1f76Xe)
GPU 1: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-1A61a9a-19f7-9670-bcc2-c0e3a6074cX8)
GPU 2: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-aAc30fe0e-d40c-2fb3-a4d5-1136a76066X7)
GPU 3: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-8Adeb867-b3ee-37b9-b81b-2c432e01cbX4)
To enhance readability, mapping the UUID to GPU indices (e.g., GPU 0, GPU 1) would be beneficial.
Furthermore, considering that GPUs might be distributed across multiple servers,
a notation like GPU X@hostname
would provide more context and clarity.
The cmd
nvidia-smi -L
can output,GPU 0: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-0Af501e0-4fa6-bc45-5993-92f20a1f76Xe) GPU 1: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-1A61a9a-19f7-9670-bcc2-c0e3a6074cX8) GPU 2: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-aAc30fe0e-d40c-2fb3-a4d5-1136a76066X7) GPU 3: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-8Adeb867-b3ee-37b9-b81b-2c432e01cbX4)
To enhance readability, mapping the UUID to GPU indices (e.g., GPU 0, GPU 1) would be beneficial.
Furthermore, considering that GPUs might be distributed across multiple servers, a notation like
GPU X@hostname
would provide more context and clarity.
Those indexes are probably not consistent, i.e., they can come in different order. Even if they don't most of the time, swap the PCI slots of 2 GPUs and then I bet they'd appear in different order.
So I'm not sure tbh if this exporter should address this. I think you can achieve what you need by relabel_configs
of Prometheis.
The cmd
nvidia-smi -L
can output,GPU 0: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-0Af501e0-4fa6-bc45-5993-92f20a1f76Xe) GPU 1: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-1A61a9a-19f7-9670-bcc2-c0e3a6074cX8) GPU 2: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-aAc30fe0e-d40c-2fb3-a4d5-1136a76066X7) GPU 3: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-8Adeb867-b3ee-37b9-b81b-2c432e01cbX4)
To enhance readability, mapping the UUID to GPU indices (e.g., GPU 0, GPU 1) would be beneficial. Furthermore, considering that GPUs might be distributed across multiple servers, a notation like
GPU X@hostname
would provide more context and clarity.Those indexes are probably not consistent, i.e., they can come in different order. Even if they don't most of the time, swap the PCI slots of 2 GPUs and then I bet they'd appear in different order.
So I'm not sure tbh if this exporter should address this. I think you can achieve what you need by
relabel_configs
of Prometheis.
When running then cmd sinfo -N -O NodeList,CPUsState,CPUsLoad,Memory,FreeMem,AllocMem,Partition,StateCompact,Gres:25,GresUsed:40 | column -t
,
I consistently see the following output:
NODELIST CPUS(A/I/O/T) CPU_LOAD MEMORY FREE_MEM ALLOCMEM PARTITION STATE GRES GRES_USED
gpu01 0/48/0/48 0.08 192676 181531 0 dev idle gpu:geforce:8(S:0-1) gpu:geforce:0(IDX:N/A)
gpu02 31/17/0/48 8.29 192676 159145 62000 gpu* mix gpu:geforce:8(S:0-1) gpu:geforce:8(IDX:0-7)
gpu03 48/0/0/48 2.26 192676 162526 96000 gpu* alloc gpu:geforce:8(S:0-1) gpu:geforce:2(IDX:0,4)
gpu04 31/17/0/48 8.17 192676 160330 62000 gpu* mix gpu:geforce:8(S:0-1) gpu:geforce:8(IDX:0-7)
gpu05 47/1/0/48 10.02 192676 157138 105552 gpu* mix gpu:geforce:8(S:0-1) gpu:geforce:5(IDX:0,4-7)
gpu06 8/40/0/48 8.18 192676 169780 16000 gpu* mix gpu:geforce:8(S:0-1) gpu:geforce:8(IDX:0-7)
gpu07 13/35/0/48 13.20 192676 169248 37552 gpu* mix gpu:geforce:8(S:0-1) gpu:geforce:8(IDX:0-7)
gpu08 38/10/0/48 15.18 192676 159122 99104 gpu* mix gpu:geforce:8(S:0-1) gpu:geforce:5(IDX:0-2,4,7)
I hypothesized that the IDX value would remain unchanged across reboots. Given the stability of my hardware setup, I believe the IDX should be deterministic.
I understand that the database currently uses GPU UUIDs as primary keys. Would it be feasible to retain UUID-based storage while dynamically resolving the corresponding IDX at query time?
I think you can achieve what you need by relabel_configs of Prometheis.
Thanks for the suggestion. I'm still relatively new to Grafana/Prometheus (about three days in), so I'll need to do some more research on relabel_configs to figure out how to implement it.
First of all I would like to say this repo is a great work and it partly solved my requirements. Since my lab's compute cards are all distributed under different hosts with different ip's, I can't very clearly tell which gpu belongs to which server by uuid. So I'm wondering that is the name of the switching gpu in the top left corner of the dashboard customizable?