vpenso / prometheus-slurm-exporter

Prometheus exporter for performance metrics from Slurm.
GNU General Public License v3.0
229 stars 142 forks source link

Add GPU per Node Metric #83

Open martialblog opened 2 years ago

martialblog commented 2 years ago

Hi,

I reworked the PR https://github.com/vpenso/prometheus-slurm-exporter/pull/57 to be compatible with the recent version. I decided not to include the GPU type just to have a minimal working version, which we can then extend later.

Fixes #60

Tested on Slurm 20.11.9 with and without GRES.

sinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong: ,Gres: ,GresUsed:"

gpu-01 113440 187000 34/22/0/56 mixed gpu:tesla:4 gpu:tesla:4(IDX:0-3)
gpu-02 80000 187000 8/48/0/56 mixed gpu:tesla:4 gpu:tesla:4(IDX:0-3)
gpu-03 64000 187000 8/48/0/56 mixed gpu:tesla:4 gpu:tesla:4(IDX:0-3)
gpu-04 36000 187000 6/50/0/56 mixed gpu:tesla:4 gpu:tesla:3(IDX:0,2-3)
gpu-05 0 187000 0/56/0/56 idle gpu:tesla:4 gpu:tesla:0(IDX:N/A)
gpu-06 12000 187000 2/54/0/56 mixed gpu:tesla:4 gpu:tesla:1(IDX:3)
gpu-07 24000 187000 4/52/0/56 mixed gpu:tesla:4 gpu:tesla:2(IDX:1-2)
gpu-08 48000 187000 8/48/0/56 mixed gpu:tesla:4 gpu:tesla:4(IDX:0-3)
cpu-01 0 502000 0/56/0/56 idle (null) gpu:0
cpu-02 0 502000 0/56/0/56 idle (null) gpu:0
cpu-03 0 502000 0/56/0/56 idle (null) gpu:0
cpu-04 0 502000 0/56/0/56 idle (null) gpu:0

curl localhost:8080/metrics | grep gpu

# HELP slurm_node_gpu_alloc Allocated GPUs per node
# TYPE slurm_node_gpu_alloc gauge
slurm_node_gpu_alloc{node="gpu-01",status="mixed"} 4
slurm_node_gpu_alloc{node="gpu-02",status="mixed"} 4
slurm_node_gpu_alloc{node="gpu-03",status="mixed"} 4
slurm_node_gpu_alloc{node="gpu-04",status="mixed"} 3
slurm_node_gpu_alloc{node="gpu-05",status="idle"} 0
slurm_node_gpu_alloc{node="gpu-06",status="mixed"} 1
slurm_node_gpu_alloc{node="gpu-07",status="mixed"} 2
slurm_node_gpu_alloc{node="gpu-08",status="mixed"} 4
# HELP slurm_node_gpu_total Total GPUs per node
# TYPE slurm_node_gpu_total gauge
slurm_node_gpu_total{node="gpu-01",status="mixed"} 4
slurm_node_gpu_total{node="gpu-02",status="mixed"} 4
slurm_node_gpu_total{node="gpu-03",status="mixed"} 4
slurm_node_gpu_total{node="gpu-04",status="mixed"} 4
slurm_node_gpu_total{node="gpu-05",status="idle"} 4
slurm_node_gpu_total{node="gpu-06",status="mixed"} 4
slurm_node_gpu_total{node="gpu-07",status="mixed"} 4
slurm_node_gpu_total{node="gpu-08",status="mixed"} 4
sinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong: ,Gres: ,GresUsed:"
localhost 0 1 0/1/0/1 unknown* (null) (null)

curl localhost:8080/metrics | grep gpu
# empty

Let me know if I should change anything.

Cheers, Markus