pcchen / scopion

Scopion cluster
2 stars 0 forks source link

Slurm : Nodes cannot be oversubscribed after the shut down on 1/28/2022 #2

Open aronton opened 2 years ago

aronton commented 2 years ago

Description

After the shut down on 1/28/2022, the oversubscribe parameter change back to exclusive and only one job can run on each node

(base) [aronton@scopion ~]$ scontrol show partition scopion1 PartitionName=scopion1 AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=scopion1[01-09] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=432 TotalNodes=9 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

Probable solution

On 11/19/2021 we change the oversubcribe parameter to "No", but we didn't change it in /etc/slurm/slurm.conf I think we have to change the setting in /etc/slurm/slurm.conf

Reference

https://slurm.schedmd.com/cons_res_share.html?fbclid=IwAR1NFxsIpUhzPdiKVxLJ_lEzTXtqYbxj3yHqzHs6maEm7ZmLUiNrehjGPCA

pcchen commented 2 years ago

/etc/slurm/slurm.conf is updated.

diff slurm.conf slurm.conf.02_07_2022
119,122c119,122
< PartitionName=scopion0 Nodes=scopion0[01-08] Default=YES State=UP Oversubscribe=NO
< PartitionName=scopion1 Nodes=scopion1[01-09] Default=YES State=UP Oversubscribe=NO
< PartitionName=scopion2 Nodes=scopion2[01-06] Default=YES State=UP Oversubscribe=NO
< PartitionName=scopion3 Nodes=scopion3[01-06] Default=YES State=UP Oversubscribe=NO
---
> PartitionName=scopion0 Nodes=scopion0[01-08] Default=YES State=UP Oversubscribe=EXCLUSIVE
> PartitionName=scopion1 Nodes=scopion1[01-09] Default=YES State=UP Oversubscribe=EXCLUSIVE
> PartitionName=scopion2 Nodes=scopion2[01-06] Default=YES State=UP Oversubscribe=EXCLUSIVE
> PartitionName=scopion3 Nodes=scopion3[01-06] Default=YES State=UP Oversubscribe=EXCLUSIVE

slrum is re-started

sudo systemctl restart slurmctld
sudo systemctl restart slurmd.service
aronton commented 2 years ago

I find something interesting. The jobs in the origin squeue is not interrupted this time, which is different from the situation of the shut down on 1/28/2022. But the original task will not oversubscribed automatically, I still need to resubmitted.

pcchen commented 2 years ago

If you modify the slurm.conf and restart the daemon, the jobs in the queue will NOT be effected. The changes only apply to new submitted jobs.