Open aronton opened 2 years ago
/etc/slurm/slurm.conf is updated.
diff slurm.conf slurm.conf.02_07_2022
119,122c119,122
< PartitionName=scopion0 Nodes=scopion0[01-08] Default=YES State=UP Oversubscribe=NO
< PartitionName=scopion1 Nodes=scopion1[01-09] Default=YES State=UP Oversubscribe=NO
< PartitionName=scopion2 Nodes=scopion2[01-06] Default=YES State=UP Oversubscribe=NO
< PartitionName=scopion3 Nodes=scopion3[01-06] Default=YES State=UP Oversubscribe=NO
---
> PartitionName=scopion0 Nodes=scopion0[01-08] Default=YES State=UP Oversubscribe=EXCLUSIVE
> PartitionName=scopion1 Nodes=scopion1[01-09] Default=YES State=UP Oversubscribe=EXCLUSIVE
> PartitionName=scopion2 Nodes=scopion2[01-06] Default=YES State=UP Oversubscribe=EXCLUSIVE
> PartitionName=scopion3 Nodes=scopion3[01-06] Default=YES State=UP Oversubscribe=EXCLUSIVE
slrum is re-started
sudo systemctl restart slurmctld
sudo systemctl restart slurmd.service
I find something interesting. The jobs in the origin squeue is not interrupted this time, which is different from the situation of the shut down on 1/28/2022. But the original task will not oversubscribed automatically, I still need to resubmitted.
If you modify the slurm.conf and restart the daemon, the jobs in the queue will NOT be effected. The changes only apply to new submitted jobs.
Description
After the shut down on 1/28/2022, the oversubscribe parameter change back to exclusive and only one job can run on each node
(base) [aronton@scopion ~]$ scontrol show partition scopion1 PartitionName=scopion1 AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=scopion1[01-09] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=432 TotalNodes=9 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
Probable solution
On 11/19/2021 we change the oversubcribe parameter to "No", but we didn't change it in /etc/slurm/slurm.conf I think we have to change the setting in /etc/slurm/slurm.conf
Reference
https://slurm.schedmd.com/cons_res_share.html?fbclid=IwAR1NFxsIpUhzPdiKVxLJ_lEzTXtqYbxj3yHqzHs6maEm7ZmLUiNrehjGPCA