jrwellshpc / slurm_power_saving

Example scripts for SLURM's power saving feature. Useful for HPC installations large and small.
GNU General Public License v3.0
3 stars 1 forks source link

Script for the power_save #1

Open Mxx-001 opened 1 year ago

Mxx-001 commented 1 year ago
jrwellshpc commented 1 year ago

Hello,

Is /etc/slurm-llnl/suspend.sh an executable?

cd /etc/slurm-llnl/ chmod ugo+x suspend.sh

But please change the ugo to whoever needs to access the program.

Mxx-001 commented 1 year ago

Thanks @jrwellshpc for the responds.

This is my first time using slurm and I have another problem. Have you ever encountered a Reason=ResumeTimeout when resume a node? It's causes the node state to become down.

Thank you for considering my question!

jrwellshpc commented 1 year ago

Hello - Yes! The basic advice I was given when writing this was to increase all of the timeouts to something like 10 minutes. It will take a long time for things to happen but that should give you a working configuration that you can then use to test bringing down the timeouts. 2 minute resume/suspend timeouts seemed to be ok for my nodes but others may take longer/shorter.

Mxx-001 commented 1 year ago

Thanks @jrwellshpc for the responds.

KillWait=30

MinJobAge=300

SlurmctldTimeout=120

SlurmdTimeout=300

# #

SCHEDULING

SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core # #

LOGGING AND ACCOUNTING

AccountingStorageType=accounting_storage/none ClusterName=cluster

JobAcctGatherFrequency=30

JobAcctGatherType=jobacct_gather/none

SlurmctldDebug=info

SlurmctldLogFile= /var/log/slurmctld.log

SlurmdDebug=info

SlurmdLogFile=/var/log/slurmd.log #

POWER SAVE SUPPORT FOR IDLE NODES (optional)

SuspendProgram=/etc/slurm/suspend.sh ResumeProgram=/etc/slurm/resume.sh SuspendTimeout=120 ResumeTimeout=1200 ResumeRate=60

SuspendExcNodes=

SuspendExcParts=

SuspendRate=50 SuspendTime=60

COMPUTE NODES

NodeName=localhost CPUs=12 RealMemory=1000 Sockets=1 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN PartitionName=compute Nodes=localhost Default=YES MaxTime=INFINITE State=UP SlurmctldParameters=enable_configless,idle_on_node_suspend



Please let me know if there are any mistakes, thanks a lot!
jrwellshpc commented 1 year ago

Hi @Mxx-001,

Could you try adding these: BatchStartTimeout=360 MessageTimeout=100

And then update this one: SuspendTime=300

Lastly, I think I missed this in my documentation but set the NodeName State = CLOUD: NodeName=localhost CPUs=12 RealMemory=1000 Sockets=1 CoresPerSocket=6 ThreadsPerCore=2 State=CLOUD

We had to do that as well.

Let's see if all of this helps.

Jason

Mxx-001 commented 1 year ago

Thanks @jrwellshpc for the responds.

jrwellshpc commented 1 year ago

You shouldn't need to tell a node to sleep. It should do this automatically behind the scenes. Doing so will convince the headnode that something has gone wrong with that node. Try shutting down everything and then boot the headnode. Wait 10 minutes and then boot the child nodes. The child nodes should go to sleep after the suspend timeout.