Closed burntyellow closed 2 years ago
The current logic looks at the total number of GPUs and subtracts the number of running GPUs. In doing so, it doesn't take into account any GPUs in a drain state.
It looks like a lot of effort was put into using squeue
instead of sinfo
for determining the number of idle resources. Is there any reason why the idle count from sinfo
souldn't be used? FOr example: sinfo -M gpu -p gtx1080 -N -o %N,%C
.
If we can just use sinfo, that would be an easy way to fix the bug and simplify the code significantly.
I can't find any documentation on where sinfo pulls gresUsed from if you supply that as in my additions to #14, but this issue appears to persist:
Reporting gpu-stage08 as 4 total, 0 allocated:
[nlc60@login0b ~] issue/16 : sinfo -h -M gpu -p gtx1080 -N --Format=NodeList:_,gres:5,gresUsed:12 ... gpu-stage08_gpu:4_gpu:(null):0_ ...
This would report as 1 node with 4 GPUs available.
Kim's sinfo reporting gpu-stage08 in drain:
[nlc60@login0b ~] issue/16 : sinfo -M gpu -p gtx1080 CLUSTER: gpu PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gtx1080* up infinite 1 drain gpu-stage08 gtx1080* up infinite 16 mix gpu-n[16-25],gpu-stage[09-14]
My current guess is that gresUSED is from slurm job data about the gres allocated.
The way to access this as a user is via sacct.
[nlc60@login0b Node-Nanny] issue/16 : sacct -a -X -M gpu --format=JobID,AllocGRES --state=RUNNING
JobID AllocGRES
------------ ------------
311836 gpu:1
312365 gpu:1
312590 gpu:1
312591 gpu:1
312592 gpu:1
312593 gpu:1
312594 gpu:1
312595 gpu:1
312600 gpu:1
312601 gpu:1
312891 gpu:1
312894 gpu:1
312916 gpu:1
312921 gpu:1
312926 gpu:1
312942 gpu:4
312954 gpu:1
312959 gpu:1
312964 gpu:1
312969 gpu:1
312980 gpu:1
312985 gpu:1
312990 gpu:1
312995 gpu:1
313000 gpu:1
313017 gpu:4
313019 gpu:16
...
I think the state is related to the host CPUS:
nlc60@login0b]: sinfo -M gpu -NO nodelist:15,cpusstate:15,gres:14,gresused:12,statelong:.10
CLUSTER: gpu
NODELIST CPUS(A/I/O/T) GRES GRES_USED STATE
gpu-n16 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n16 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n17 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n17 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n18 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n18 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n19 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n19 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n20 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n20 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n21 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n21 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n22 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n22 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n23 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n23 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n24 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n24 4/4/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n25 5/3/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n25 5/3/0/8 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-n26 0/24/0/24 gpu:4(S:0-1) gpu:(null):0 idle
gpu-n26 0/24/0/24 gpu:4(S:0-1) gpu:(null):0 idle
gpu-n27 2/22/0/24 gpu:4(S:0-1) gpu:(null):2 mixed
gpu-n27 2/22/0/24 gpu:4(S:0-1) gpu:(null):2 mixed
gpu-n27 2/22/0/24 gpu:4(S:0-1) gpu:(null):2 mixed
gpu-n28 0/0/128/128 gpu:8(S:0-1) gpu:(null):0 drained
gpu-n29 8/120/0/128 gpu:8(S:0-1) gpu:(null):8 mixed
gpu-n30 4/124/0/128 gpu:8(S:0-1) gpu:(null):4 mixed
gpu-stage01 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage01 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage02 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage02 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage03 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage03 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage04 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage04 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage05 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage05 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage06 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage06 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage07 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage07 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage08 0/0/12/12 gpu:4(S:0-1) gpu:(null):0 drained
gpu-stage08 0/0/12/12 gpu:4(S:0-1) gpu:(null):0 drained
gpu-stage09 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage09 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage10 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage10 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage11 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage11 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage12 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage12 4/8/0/12 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage13 1/11/0/12 gpu:4(S:0-1) gpu:(null):1 mixed
gpu-stage13 1/11/0/12 gpu:4(S:0-1) gpu:(null):1 mixed
gpu-stage14 4/12/0/16 gpu:4(S:0-1) gpu:(null):4 mixed
gpu-stage14 4/12/0/16 gpu:4(S:0-1) gpu:(null):4 mixed
ppc-n0 4/124/0/128 gpu:4(S:0-1) gpu:(null):4 mixed
ppc-n0 4/124/0/128 gpu:4(S:0-1) gpu:(null):4 mixed
ppc-n1 4/124/0/128 gpu:4(S:0-1) gpu:(null):4 mixed
ppc-n1 4/124/0/128 gpu:4(S:0-1) gpu:(null):4 mixed
ppc-n2 4/124/0/128 gpu:4(S:0-1) gpu:(null):4 mixed
ppc-n2 4/124/0/128 gpu:4(S:0-1) gpu:(null):4 mixed
ppc-n3 4/124/0/128 gpu:4(S:0-1) gpu:(null):4 mixed
ppc-n3 4/124/0/128 gpu:4(S:0-1) gpu:(null):4 mixed
ppc-n4 0/128/0/128 gpu:4(S:0-1) gpu:(null):0 idle
ppc-n4 0/128/0/128 gpu:4(S:0-1) gpu:(null):0 idle
smpgpu-n0 0/20/0/20 gpu:2(S:0-1) gpu:(null):0 idle
smpgpu-n0 0/20/0/20 gpu:2(S:0-1) gpu:(null):0 idle
I'll look into adding something to #14 to take node state into account
See the gtx1080 partition. When k40 was in drain state, it also reported the 2 GPUs as idle.