Open Mxx-001 opened 1 year ago
Hello,
Is /etc/slurm-llnl/suspend.sh an executable?
cd /etc/slurm-llnl/ chmod ugo+x suspend.sh
But please change the ugo to whoever needs to access the program.
Thanks @jrwellshpc for the responds.
This is my first time using slurm and I have another problem. Have you ever encountered a Reason=ResumeTimeout
when resume a node? It's causes the node state to become down.
Thank you for considering my question!
Hello - Yes! The basic advice I was given when writing this was to increase all of the timeouts to something like 10 minutes. It will take a long time for things to happen but that should give you a working configuration that you can then use to test bringing down the timeouts. 2 minute resume/suspend timeouts seemed to be ok for my nodes but others may take longer/shorter.
Thanks @jrwellshpc for the responds.
I set the timeout for suspend and resume to 15 minutes, but it will still timeout. In fact, my suspend.sh and resume.sh executed within 1 second.
scontrol show node
displays State=MIXED+NOT_RESPONDING+POWERING_UP
.
The following is my slurm.conf,
lurmctldHost=localhost
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
# #
SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core # #
AccountingStorageType=accounting_storage/none ClusterName=cluster
JobAcctGatherType=jobacct_gather/none
SlurmctldLogFile= /var/log/slurmctld.log
SlurmdLogFile=/var/log/slurmd.log #
SuspendProgram=/etc/slurm/suspend.sh ResumeProgram=/etc/slurm/resume.sh SuspendTimeout=120 ResumeTimeout=1200 ResumeRate=60
SuspendRate=50 SuspendTime=60
NodeName=localhost CPUs=12 RealMemory=1000 Sockets=1 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN PartitionName=compute Nodes=localhost Default=YES MaxTime=INFINITE State=UP SlurmctldParameters=enable_configless,idle_on_node_suspend
Please let me know if there are any mistakes, thanks a lot!
Hi @Mxx-001,
Could you try adding these: BatchStartTimeout=360 MessageTimeout=100
And then update this one: SuspendTime=300
Lastly, I think I missed this in my documentation but set the NodeName State = CLOUD: NodeName=localhost CPUs=12 RealMemory=1000 Sockets=1 CoresPerSocket=6 ThreadsPerCore=2 State=CLOUD
We had to do that as well.
Let's see if all of this helps.
Jason
Thanks @jrwellshpc for the responds.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up infinite 0 n/a
And after I tried to resume the node with srun -l sleep 10 &
, the error Resumetimeout
was still reported.
You shouldn't need to tell a node to sleep. It should do this automatically behind the scenes. Doing so will convince the headnode that something has gone wrong with that node. Try shutting down everything and then boot the headnode. Wait 10 minutes and then boot the child nodes. The child nodes should go to sleep after the suspend timeout.
I can't find a solution to this error message elsewhere. Does anyone know how to fix it?