Closed danielhollas closed 2 years ago
List of scheduler that AiiDA supports out of the box: SLURM, LSF, PBSPro, Torque
https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/schedulers.html
Installation instructions for SLURM:
Aiidalab image should be based on Ubuntu 20.04, so we should be able to just install the slurm-wlm
package.
apt install slurm-wlm
http://docs.nanomatch.de/technical/SimStackRequirements/SingleNodeSlurm.html
Example of Slurm in Docker https://github.com/SciDAS/slurm-in-docker https://xtreme-d.net/en/news/techblog-en/1245
SLURM configuration tool https://slurm.schedmd.com/configurator.html
NOTE: SlurmctldHost param needs to be renamed to ControlMachine
for the version slurm-wlm 17.1
I managed to install and test SLURM in a running container. :tada: Here's the configuration file /etc/slurm-llnl/slurm.conf
# slurm.conf file generated by configurator.html.
# See the slurm.conf man page for more information.
#
ClusterName=aiidalab
#SlurmctldHost=localhost
ControlMachine=localhost
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=aiida
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
#StateSaveLocation=/var/spool/slurmctld
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
#AccountingStoreFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
# COMPUTE NODES
NodeName=localhost RealMemory=128000 ThreadsPerCore=1 Sockets=1 CoresPerSocket=16 State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
For production, we will definitely need to tweak this more.
Couple things I ran into:
slurm
user, and perhaps add aiida user to slurm group, which is created automatically when slurm is installed.To test things, open three terminals in the container, logged in as root:
$ munged --force
authentication daemon. Run first in the foreground and with debug flags to verify that it started. The --force
flag is needed to skip a warning about permissions.slurmctld
$ slurmctld -D -v
$ slurmd -D -v
$ srun cat /etc/hostname
To test things, open three terminals in the container, logged in as root:
$ munged --force
authentication daemon. Run first in the foreground and with debug flags to verify that it started. The --force
flag is needed to skip a warning about permissions.slurmctld
$ slurmctld -D -v
$ slurmd -D -v
$ srun cat /etc/hostname
$ slurmd -C
Next step is to include the whole setup in our own Dockerfile.
AiiDA currently cannot by itself limit the number of calculations that are launched in parallel. It is designed to talk to a queuing system. Either we need to install a queuing system inside the Docker container (in which case we need to build our own Docker image), or install queuing system on Jackdaw. Probably try SLURM first, but perhaps there are simpler systems)?