pitt-crc / wrappers

User focused command line wrappers around Slurm
https://crc-pages.pitt.edu/wrappers/
GNU General Public License v3.0
1 stars 1 forks source link

GPU counting fix is broken #31

Closed Comeani closed 2 years ago

Comeani commented 2 years ago

https://github.com/pitt-crc/wrappers/blob/5f3c8e336f8157a72af4fb64af1a5633317df5c6/crc-idle.py#L57-L73

When attempting to build a list of used GPU resources from the output of squeue, it's possible for there to be multiple nodes used by a single job. The code that attempts to handle breaks adds the incorrect value to the dictionary.

`ipdb> nodes
['gpu-n]', '19]']

KeyError: ('gpu-n]',)
> /ix/crc/nlc60/wrappers/github/wrappers/crc-idle.py(65)gpu_based_empty()
     64                 try:
---> 65                     used_counts[node] += count
     66                 except:

ipdb> used_counts
{'gpu-n22': 4, 'gpu-n23': 4, 'gpu-n20': 4, 'gpu-n21': 4, 'gpu-n24': 4, 'gpu-n25': 4, '19]': 4, 'gpu-n]': 4, 'gpu-n18': 3, 'gpu-stage13': 4, 'gpu-stage11': 4, 'gpu-stage10': 4, 'gpu-n17': 3, 'gpu-stage14': 4}
Comeani commented 2 years ago

I'm not sure how to reproduce the situation where an output line from sinfo shows as a list of nodes (gpu-n[19,20]), but the formatting for the sinfo output is different in #14 and may prevent this from being an issue altogether. It also looks like that code reports more information, although I still need to check that it's accurate.

[nlc60@login0b wrappers] issue/6 : ./crc-idle.py -g   (crc-idle from v0.1.0 of the wrapper scripts).
Cluster: gpu, Partition: gtx1080
================================
 10 nodes w/   4 idle GPUs

  5 nodes w/   8 idle GPUs

  1 nodes w/  12 idle GPUs

Cluster: gpu, Partition: titanx
===============================
  6 nodes w/   8 idle GPUs

  1 nodes w/   9 idle GPUs

Cluster: gpu, Partition: k40
============================
  1 nodes w/  20 idle GPUs

Cluster: gpu, Partition: v100
=============================
  1 nodes w/  22 idle GPUs

[nlc60@login0b wrappers] issue/6 : crc-idle.py -g    (current crc-idle)
Cluster: gpu, Partition: gtx1080
================================
  1 nodes w/   4 idle GPUs
Cluster: gpu, Partition: titanx
===============================
  1 nodes w/   1 idle GPUs
Cluster: gpu, Partition: k40
============================
  1 nodes w/   2 idle GPUs
Cluster: gpu, Partition: v100
=============================
  1 nodes w/   2 idle GPUs
djperrefort commented 2 years ago

I think there may be some code in the original version of the application to handle this.

The difference in reported information may have to do with an attempt to fix #6