Automatic resubmission of "soft" VASP errors

ligerzero-ai commented 2 years ago

The option to enable automatic reruns of "soft" errors in VASP, which are easily resolved by restarting the run with cp CONTCAR->POSCAR. (e.g. ZBRENT errors, running out of ionic steps)

When running high-throughput studies for my GB systems, I find that it is often useful to have automated resubmission and handling of these soft errors. This saves the user the trouble of having to handle these soft errors manually through resubmission and reduces noise in the job-postprocessing. This is easily handled by the addition of an if-statement in the actual shell script that is submitted to the system. See below for my script that I have macgyver'd to do this for me:

#!/bin/bash -l
#SBATCH --nodes=4
#SBATCH --account=pawsey0380
#SBATCH --job-name=2-element-calc
#SBATCH --time=24:00:00
#SBATCH --partition=workq
#SBATCH --export=NONE

module swap PrgEnv-cray PrgEnv-intel
module load vasp/5.4.4
cd "$PBS_O_WORKDIR"

ulimit -s unlimited
run_cmd="srun --export=ALL -N 4 -n 96 vasp_std"

cont_calc () {
date_str=$(date +"%d-%m-%y_%H-%M")
if [ -f CONTCAR ]
then
    if [ -s CONTCAR ]
    then
        echo "CONTCAR exists and not empty: restarting from CONTCAR"
          cp POSCAR "$date_str-POSCAR"
        cp CONTCAR POSCAR
        $run_cmd &> vasp.log
    else
        echo "CONTCAR exists but is empty: restarting from POSCAR"
        $run_cmd &> vasp.log
    fi
else
    echo "CONTCAR does not exist: starting from POSCAR"
    $run_cmd &> vasp.log
fi
}
# job name is the folder name
job_name=$(basename $PWD)
# while counter
i=0
# The max iteration steps from INCAR file
iteration_nsw=$(grep NSW INCAR | awk '{print $3}')
# 
date_str=$(date +"%d-%m-%y_%H")

$run_cmd &> vasp.log

# initiate the while loop max 6 iterations
while [ $i -le 4 ]; do
# Run the job
i=$(( $i + 1 ))
# Copy contcar to basename.vasp
cp CONTCAR "$job_name.vasp"
# date string 
# Check for convergence
if grep -q "reached required accuracy - stopping structural energy minimisation" vasp.log ; then
    # Set counter to be greater than break condition
    i=100
    echo "$i : $job_name is converged"
else
    # Add one to the counter
    iteration=$(grep Iter OUTCAR | tail -1 | awk '{ print $3 }' | sed 's/(.*//')
    # If # ionic iteration loops is = max specified in INCAR
    if [ $iteration == $iteration_nsw ]; then
        cp POSCAR "POSCAR-$date_str-$i"
        cp CONTCAR "CONTCAR-$date_str-$i"
        cp vasp.log "vasp.log-$date_str-$i"
        cp OUTCAR "OUTCAR-$date_str-$i"
        cp CONTCAR POSCAR
        cont_calc $job_name
        echo "$i : $job_name ran out of iterations: restarting"
    elif grep -q "fatal error in bracketing" vasp.log; then
        cp POSCAR "POSCAR-$date_str-$i"
        cp CONTCAR "CONTCAR-$date_str-$i"
        cp vasp.log "vasp.log-$date_str-$i"
        cp OUTCAR "OUTCAR-$date_str-$i"
        cp CONTCAR POSCAR
        cont_calc $job_name
        echo "$i : $job_name experienced ZBRENT error, needs refinement: restarting..."
    else
        i=600
        echo -e "$i : error: $job_name either crashed or some other error; check the old.vasp.log\!"
    fi
fi
done

I am not exactly sure how to implement this in pyiron, but have discussed with @niklassiemer.

I think the key here is that we want to keep this as a single job in the job scheduler, but also preserve the in-between output if there are resubmissions so that we can leverage that data for later as well. Idk what is the best way to do this

ligerzero-ai commented 2 years ago

Here I note that I have preserved the output files via iteration naming, but I am not sure how to handle this properly in pyiron.

ligerzero-ai commented 2 years ago

Happy to handle this, but I need to know the best/most elegant way to handle this with the pyiron internals. Some guidance and handholding may be necessary as to where it is most appropriate to change/add this functionality.

pmrv commented 2 years ago

Check the {get,set}_eddrmm_handling methods on the Vasp job. I think this should be the model system for automatic job submission. Basically you check the output whether an soft error was triggered and, if the user configured it, fire off a new job with the appropriate fixes applied.

Just from my gut I am against keeping it one job in the scheduler, because

this will most likely conflict with the compute time you asked for the job
it will confuse pyiron output parsing
it'll be less transparent/configurable to users

pyiron / pyiron_atomistics

Automatic resubmission of "soft" VASP errors #751