ispg-group / aiidalab-ispg

ATMOSPEC: ab initio UV/vis spectroscopy for everyone
MIT License
6 stars 4 forks source link

Queuing system #40

Closed danielhollas closed 2 years ago

danielhollas commented 2 years ago

AiiDA currently cannot by itself limit the number of calculations that are launched in parallel. It is designed to talk to a queuing system. Either we need to install a queuing system inside the Docker container (in which case we need to build our own Docker image), or install queuing system on Jackdaw. Probably try SLURM first, but perhaps there are simpler systems)?

danielhollas commented 2 years ago

List of scheduler that AiiDA supports out of the box: SLURM, LSF, PBSPro, Torque

https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/schedulers.html

Installation instructions for SLURM: Aiidalab image should be based on Ubuntu 20.04, so we should be able to just install the slurm-wlm package.

apt install slurm-wlm

http://docs.nanomatch.de/technical/SimStackRequirements/SingleNodeSlurm.html

Example of Slurm in Docker https://github.com/SciDAS/slurm-in-docker https://xtreme-d.net/en/news/techblog-en/1245

SLURM configuration tool https://slurm.schedmd.com/configurator.html

NOTE: SlurmctldHost param needs to be renamed to ControlMachine for the version slurm-wlm 17.1

Example configuration from the configurator ``` # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ClusterName=aiidalab #SlurmctldHost=cec96e08018e ControlMachine=cec96e08018e #SlurmctldHost= # #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=67043328 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=lua #KillOnBadExit=0 #LaunchType=launch/slurm #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=10000 #MaxStepCount=40000 #MaxTasksPerNode=512 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/cgroup #Prolog= #PrologFlags= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= #RebootProgram= ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=aiida #SlurmdUser=root #SrunEpilog= #SrunProlog= #StateSaveLocation=/var/spool/slurmctld StateSaveLocation=/home/aiida/slurmctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/affinity #TaskProlog= #TopologyPlugin=topology/tree #TmpFS=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 #MaxMemPerCPU=0 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core # # # JOB PRIORITY #PriorityFlags= #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageHost= #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/none #AccountingStorageUser= #AccountingStoreFlags= #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= #JobContainerType=job_container/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=info SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurmd.log #SlurmSchedLogFile= #SlurmSchedLogLevel= #DebugFlags= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=cec96e08018e CPUs=16 State=UNKNOWN PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP `
danielhollas commented 2 years ago

I managed to install and test SLURM in a running container. :tada: Here's the configuration file /etc/slurm-llnl/slurm.conf

# slurm.conf file generated by configurator.html.
# See the slurm.conf man page for more information.
#
ClusterName=aiidalab
#SlurmctldHost=localhost
ControlMachine=localhost
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs

ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=aiida
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
#StateSaveLocation=/var/spool/slurmctld
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity

# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=

# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
#AccountingStoreFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=

# COMPUTE NODES
NodeName=localhost RealMemory=128000 ThreadsPerCore=1 Sockets=1 CoresPerSocket=16 State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

For production, we will definitely need to tweak this more.

Couple things I ran into:

To test things, open three terminals in the container, logged in as root:

  1. Run $ munged --force authentication daemon. Run first in the foreground and with debug flags to verify that it started. The --force flag is needed to skip a warning about permissions.
  2. Run the main SLURM control daemon slurmctld $ slurmctld -D -v
  3. Run the computational daemon $ slurmd -D -v
  4. Execute a job $ srun cat /etc/hostname
  5. We perhaps have PID file permission issue. Check all the warning messages and see https://stackoverflow.com/questions/56553665/how-to-fix-slurmd-service-cant-open-pid-file-error-in-slurm/64439176#64439176

To test things, open three terminals in the container, logged in as root:

  1. Run $ munged --force authentication daemon. Run first in the foreground and with debug flags to verify that it started. The --force flag is needed to skip a warning about permissions.
  2. Run the main SLURM control daemon slurmctld $ slurmctld -D -v
  3. Run the computational daemon $ slurmd -D -v
  4. Execute a job $ srun cat /etc/hostname
  5. To check node config, execute $ slurmd -C

Next step is to include the whole setup in our own Dockerfile.