Open iparask opened 5 years ago
Hey @andre-merzky , do you remember how we said we are going to tackle this? I remember I would pick it up, but I missed it in my todo list
It took me a second, but I remember it!
The solution was to execute scontrol show partitions | grep -E 'PartitionName|TotalCPUs|TotalNodes'
instead of the command executed now.
On Bridges this returns:
PartitionName=RM
State=UP TotalCPUs=20160 TotalNodes=720 SelectTypeParameters=NONE
PartitionName=RM-shared
State=UP TotalCPUs=1932 TotalNodes=69 SelectTypeParameters=NONE
PartitionName=RM-small
State=UP TotalCPUs=140 TotalNodes=5 SelectTypeParameters=NONE
PartitionName=GPU
State=UP TotalCPUs=1344 TotalNodes=44 SelectTypeParameters=NONE
PartitionName=GPU-shared
State=UP TotalCPUs=700 TotalNodes=23 SelectTypeParameters=NONE
PartitionName=GPU-small
State=UP TotalCPUs=128 TotalNodes=4 SelectTypeParameters=NONE
PartitionName=GPU-AI
State=UP TotalCPUs=456 TotalNodes=10 SelectTypeParameters=NONE
PartitionName=LM
State=UP TotalCPUs=4512 TotalNodes=46 SelectTypeParameters=NONE
PartitionName=XLM
State=UP TotalCPUs=1280 TotalNodes=4 SelectTypeParameters=NONE
PartitionName=DBMI
State=UP TotalCPUs=256 TotalNodes=8 SelectTypeParameters=NONE
PartitionName=DBMI-GPU
State=UP TotalCPUs=64 TotalNodes=2 SelectTypeParameters=NONE
If we take every partition and we calculate the ppn per node, by dividing TotalCPUs
with TotalNodes
, I get the following dictionary:
{'DBMI': 32,
'DBMI-GPU': 32,
'GPU': 31,
'GPU-AI': 46,
'GPU-shared': 31,
'GPU-small': 32,
'LM': 99,
'RM': 28,
'RM-shared': 28,
'RM-small': 28,
'XLM': 320}
This is wrong because the GPU nodes have either 28 or 32 cores based on the type of gpu we are selecting. I propose to keep the ppn checking as it was and strict the version checking for Stampede2 and what it was already. Is that okay with you?
On Bridges for example we have SLURM 17.11.7 and it offers at least 2 different core per node counts.
That is 28 for the RM queue and 32 for the GPU queue when the nodes with the P100 GPUs are used.
Lines 389-391 will select 28 cores per node because that is the output of line 388
We need to come up with a way to select the correct value. In addition, Bridges does not require the
-N
flag in the SLURM script, while Stampede2 with SLURM 18.08.3 does.