ulissigroup / vasp-interactive

GNU Lesser General Public License v2.1
53 stars 11 forks source link

Pausing jobs on Slurm systems #29

Closed alchem0x2A closed 1 year ago

alchem0x2A commented 1 year ago

Related to #25 and #26. When running VASP by srun in slurm environment, the mpi interface may or may not be exposed to end user, so directly sending signals to srun is not working. From the slurm manual, the preferred way to pause / resume a srun step is as follows:

NOTE: A suspended job releases its CPUs for allocation to other jobs. Resuming a previously suspended job may result in multiple jobs being allocated the same CPUs, which could trigger gang scheduling with some configurations or severe degradation in performance with other configurations. Use of the scancel command to send SIGSTOP and SIGCONT signals would stop a job without releasing its CPUs for allocation to other jobs and would be a preferable mechanism in many cases. If performing system maintenance you may want to use suspend/resume in the following way. Before suspending set all nodes to draining or set all partitions to down so that no new jobs can be scheduled. Then suspend jobs. Once maintenance is done resume jobs then resume nodes and/or set all partitions back to up. Use with caution.

Simple test shows that it can be done by the following steps (thx Mark Glines for the hint)

  1. Determine if current VASP_COMMAND contains srun directive and actually in a SLURM environment by checking the env
  2. Find the SLURM_JOB_ID of current job (i.e. that submitted by sbatch or salloc)
  3. Find the vasp step of current job by squeue -s --job <jobid> which may show something like follows
    STEPID     NAME PARTITION     USER      TIME NODELIST
    62793872.1 vasp_std interacti  ttian20     12:06 nid02338
    62793872.intera interact interacti  ttian20     15:13 nid02338
    62793872.extern   extern interacti  ttian20     15:13 nid02338

    stepid 62793872.1 is the job step of vasp we want to pause / resume

  4. Send TSTP signal to this step scancel -s SIGTSTP 62793872.1. top should show CPU usage drops to 0
  5. Send CONT signal to this step scancel -s SIGCONT 62793872.1 to resume.

We may want to do step 3-5 every time pause / resume is involved as the step id may change

alchem0x2A commented 1 year ago

close as all PR merged