pitt-crc / wrappers

User focused command line wrappers around Slurm
https://crc-pages.pitt.edu/wrappers/
GNU General Public License v3.0
1 stars 1 forks source link

crc-idle.py reports GPUs that are in drain state as idle #6

Closed burntyellow closed 2 years ago

burntyellow commented 2 years ago

See the gtx1080 partition. When k40 was in drain state, it also reported the 2 GPUs as idle.

[kimwong@login1.crc.pitt.edu ~]$sinfo -M gpu
CLUSTER: gpu
PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
gtx1080*          up   infinite      1  drain gpu-stage08
gtx1080*          up   infinite     16    mix gpu-n[16-25],gpu-stage[09-14]
titanx            up   infinite      7    mix gpu-stage[01-07]
k40               up   infinite      1   idle smpgpu-n0
isenocak          up   infinite      1   idle gpu-n26
v100              up   infinite      1    mix gpu-n27
isenocak-mpi      up   infinite      1    mix gpu-n27
isenocak-mpi      up   infinite      1   idle gpu-n26
eschneider        up   infinite      1   idle ppc-n0
power9            up   infinite      4   idle ppc-n[1-4]
eschneider-mpi    up   infinite      5   idle ppc-n[0-4]
scavenger         up   infinite      1  drain gpu-stage08
scavenger         up   infinite     24    mix gpu-n[16-25,27],gpu-stage[01-07,09-14]
scavenger         up   infinite      1   idle smpgpu-n0
a100              up   infinite      3    mix gpu-n[28-30]
[kimwong@login1.crc.pitt.edu ~]$crc-idle.py -g
Cluster: gpu, Partition: gtx1080
================================
  1 nodes w/   4 idle GPUs
Cluster: gpu, Partition: titanx
===============================
  1 nodes w/   2 idle GPUs
Cluster: gpu, Partition: k40
============================
  1 nodes w/   2 idle GPUs
Cluster: gpu, Partition: v100
=============================
  1 nodes w/   3 idle GPUs
[kimwong@login1.crc.pitt.edu ~]$
djperrefort commented 2 years ago

The current logic looks at the total number of GPUs and subtracts the number of running GPUs. In doing so, it doesn't take into account any GPUs in a drain state.

It looks like a lot of effort was put into using squeue instead of sinfo for determining the number of idle resources. Is there any reason why the idle count from sinfo souldn't be used? FOr example: sinfo -M gpu -p gtx1080 -N -o %N,%C.

If we can just use sinfo, that would be an easy way to fix the bug and simplify the code significantly.

Comeani commented 2 years ago

I can't find any documentation on where sinfo pulls gresUsed from if you supply that as in my additions to #14, but this issue appears to persist:

Reporting gpu-stage08 as 4 total, 0 allocated: [nlc60@login0b ~] issue/16 : sinfo -h -M gpu -p gtx1080 -N --Format=NodeList:_,gres:5,gresUsed:12 ... gpu-stage08_gpu:4_gpu:(null):0_ ... This would report as 1 node with 4 GPUs available.

Kim's sinfo reporting gpu-stage08 in drain: [nlc60@login0b ~] issue/16 : sinfo -M gpu -p gtx1080 CLUSTER: gpu PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gtx1080* up infinite 1 drain gpu-stage08 gtx1080* up infinite 16 mix gpu-n[16-25],gpu-stage[09-14]

Comeani commented 2 years ago

My current guess is that gresUSED is from slurm job data about the gres allocated.

The way to access this as a user is via sacct.

[nlc60@login0b Node-Nanny] issue/16 : sacct -a -X -M gpu --format=JobID,AllocGRES --state=RUNNING
       JobID    AllocGRES
------------ ------------
311836              gpu:1
312365              gpu:1
312590              gpu:1
312591              gpu:1
312592              gpu:1
312593              gpu:1
312594              gpu:1
312595              gpu:1
312600              gpu:1
312601              gpu:1
312891              gpu:1
312894              gpu:1
312916              gpu:1
312921              gpu:1
312926              gpu:1
312942              gpu:4
312954              gpu:1
312959              gpu:1
312964              gpu:1
312969              gpu:1
312980              gpu:1
312985              gpu:1
312990              gpu:1
312995              gpu:1
313000              gpu:1
313017              gpu:4
313019             gpu:16
...

I think the state is related to the host CPUS:

nlc60@login0b]: sinfo -M gpu -NO nodelist:15,cpusstate:15,gres:14,gresused:12,statelong:.10
CLUSTER: gpu
NODELIST       CPUS(A/I/O/T)  GRES          GRES_USED        STATE
gpu-n16        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n16        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n17        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n17        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n18        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n18        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n19        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n19        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n20        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n20        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n21        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n21        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n22        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n22        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n23        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n23        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n24        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n24        4/4/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n25        5/3/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n25        5/3/0/8        gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-n26        0/24/0/24      gpu:4(S:0-1)  gpu:(null):0      idle
gpu-n26        0/24/0/24      gpu:4(S:0-1)  gpu:(null):0      idle
gpu-n27        2/22/0/24      gpu:4(S:0-1)  gpu:(null):2     mixed
gpu-n27        2/22/0/24      gpu:4(S:0-1)  gpu:(null):2     mixed
gpu-n27        2/22/0/24      gpu:4(S:0-1)  gpu:(null):2     mixed
gpu-n28        0/0/128/128    gpu:8(S:0-1)  gpu:(null):0   drained
gpu-n29        8/120/0/128    gpu:8(S:0-1)  gpu:(null):8     mixed
gpu-n30        4/124/0/128    gpu:8(S:0-1)  gpu:(null):4     mixed
gpu-stage01    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage01    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage02    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage02    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage03    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage03    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage04    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage04    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage05    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage05    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage06    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage06    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage07    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage07    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage08    0/0/12/12      gpu:4(S:0-1)  gpu:(null):0   drained
gpu-stage08    0/0/12/12      gpu:4(S:0-1)  gpu:(null):0   drained
gpu-stage09    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage09    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage10    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage10    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage11    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage11    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage12    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage12    4/8/0/12       gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage13    1/11/0/12      gpu:4(S:0-1)  gpu:(null):1     mixed
gpu-stage13    1/11/0/12      gpu:4(S:0-1)  gpu:(null):1     mixed
gpu-stage14    4/12/0/16      gpu:4(S:0-1)  gpu:(null):4     mixed
gpu-stage14    4/12/0/16      gpu:4(S:0-1)  gpu:(null):4     mixed
ppc-n0         4/124/0/128    gpu:4(S:0-1)  gpu:(null):4     mixed
ppc-n0         4/124/0/128    gpu:4(S:0-1)  gpu:(null):4     mixed
ppc-n1         4/124/0/128    gpu:4(S:0-1)  gpu:(null):4     mixed
ppc-n1         4/124/0/128    gpu:4(S:0-1)  gpu:(null):4     mixed
ppc-n2         4/124/0/128    gpu:4(S:0-1)  gpu:(null):4     mixed
ppc-n2         4/124/0/128    gpu:4(S:0-1)  gpu:(null):4     mixed
ppc-n3         4/124/0/128    gpu:4(S:0-1)  gpu:(null):4     mixed
ppc-n3         4/124/0/128    gpu:4(S:0-1)  gpu:(null):4     mixed
ppc-n4         0/128/0/128    gpu:4(S:0-1)  gpu:(null):0      idle
ppc-n4         0/128/0/128    gpu:4(S:0-1)  gpu:(null):0      idle
smpgpu-n0      0/20/0/20      gpu:2(S:0-1)  gpu:(null):0      idle
smpgpu-n0      0/20/0/20      gpu:2(S:0-1)  gpu:(null):0      idle

I'll look into adding something to #14 to take node state into account