Mxx-001 commented 1 year ago

system: Ubuntu20.04 LTS

log:

[2022-11-01T23:53:02.328] error: power_save program /etc/slurm-llnl/suspend.sh not executable
[2022-11-01T23:53:02.328] error: power_save module disabled, invalid SuspendProgram /etc/slurm-llnl/suspend.sh

I can't find a solution to this error message elsewhere. Does anyone know how to fix it?

jrwellshpc commented 1 year ago

Hello,

Is /etc/slurm-llnl/suspend.sh an executable?

cd /etc/slurm-llnl/ chmod ugo+x suspend.sh

But please change the ugo to whoever needs to access the program.

Mxx-001 commented 1 year ago

Thanks @jrwellshpc for the responds.

This is my first time using slurm and I have another problem. Have you ever encountered a Reason=ResumeTimeout when resume a node? It's causes the node state to become down.

Thank you for considering my question!

jrwellshpc commented 1 year ago

Hello - Yes! The basic advice I was given when writing this was to increase all of the timeouts to something like 10 minutes. It will take a long time for things to happen but that should give you a working configuration that you can then use to test bringing down the timeouts. 2 minute resume/suspend timeouts seemed to be ok for my nodes but others may take longer/shorter.

Mxx-001 commented 1 year ago

Thanks @jrwellshpc for the responds.

I set the timeout for suspend and resume to 15 minutes, but it will still timeout. In fact, my suspend.sh and resume.sh executed within 1 second.
scontrol show node displays State=MIXED+NOT_RESPONDING+POWERING_UP.

The following is my slurm.conf,


lurmctldHost=localhost
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity

KillWait=30

MinJobAge=300

SlurmctldTimeout=120

SlurmdTimeout=300

# #

SCHEDULING

SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core # #

LOGGING AND ACCOUNTING

AccountingStorageType=accounting_storage/none ClusterName=cluster

JobAcctGatherFrequency=30

JobAcctGatherType=jobacct_gather/none

SlurmctldDebug=info

SlurmctldLogFile= /var/log/slurmctld.log

SlurmdDebug=info

SlurmdLogFile=/var/log/slurmd.log #

POWER SAVE SUPPORT FOR IDLE NODES (optional)

SuspendProgram=/etc/slurm/suspend.sh ResumeProgram=/etc/slurm/resume.sh SuspendTimeout=120 ResumeTimeout=1200 ResumeRate=60

SuspendExcNodes=

SuspendExcParts=

SuspendRate=50 SuspendTime=60

COMPUTE NODES

NodeName=localhost CPUs=12 RealMemory=1000 Sockets=1 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN PartitionName=compute Nodes=localhost Default=YES MaxTime=INFINITE State=UP SlurmctldParameters=enable_configless,idle_on_node_suspend



Please let me know if there are any mistakes, thanks a lot!

jrwellshpc commented 1 year ago

Hi @Mxx-001,

Could you try adding these: BatchStartTimeout=360 MessageTimeout=100

And then update this one: SuspendTime=300

Lastly, I think I missed this in my documentation but set the NodeName State = CLOUD: NodeName=localhost CPUs=12 RealMemory=1000 Sockets=1 CoresPerSocket=6 ThreadsPerCore=2 State=CLOUD

We had to do that as well.

Let's see if all of this helps.

Jason

Mxx-001 commented 1 year ago

Thanks @jrwellshpc for the responds.

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up infinite 0 n/a
And after I tried to resume the node with srun -l sleep 10 &, the error Resumetimeout was still reported.

jrwellshpc commented 1 year ago

You shouldn't need to tell a node to sleep. It should do this automatically behind the scenes. Doing so will convince the headnode that something has gone wrong with that node. Try shutting down everything and then boot the headnode. Wait 10 minutes and then boot the child nodes. The child nodes should go to sleep after the suspend timeout.

jrwellshpc / slurm_power_saving

Script for the power_save #1

KillWait=30

MinJobAge=300

SlurmctldTimeout=120

SlurmdTimeout=300

SCHEDULING

LOGGING AND ACCOUNTING

JobAcctGatherFrequency=30

SlurmctldDebug=info

SlurmdDebug=info

POWER SAVE SUPPORT FOR IDLE NODES (optional)

SuspendExcNodes=

SuspendExcParts=

COMPUTE NODES