hpcugent / csub

Generate a wrapper script around DMTCP and the job submission system to auto checkpoint certain jobs.
GNU General Public License v3.0
2 stars 4 forks source link

Simple script runs ad infinitum #11

Open epauwels opened 6 years ago

epauwels commented 6 years ago

csub keeps on running in loops when I submit the script below via

csub -s Sleep.sh --job_time=0:1:0

!/bin/bash

#

PBS -N Sleep

PBS -o RUN.log

PBS -e RUN.err

PBS -q default

PBS -l walltime=00:2:00

PBS -l nodes=1:ppn=1

PBS -m ae

#

ulimit -s unlimited module load scripts

echo Hostname: $(hostname)

z=0 while [ $z -le 72 ] do echo $z date sleep 5 z=`expr $z + 1` done'

boegel commented 6 years ago

Some more info, error output from DMTCP (2.5.1) on restart from checkpoint:

gzip: stdout: Broken pipe
[4398] mtcp_restart.c:589 restorememoryareas:
  error restoring brk: 0
[74000] NOTE at processinfo.cpp:372 in restoreHeap; REASON='Area between saved_break and curr_break not mapped, mapping it now'
     _savedBrk = 6459392
     curBrk = 6467584
/var/spool/pbs/mom_priv/jobs/4007352.master15.delcatty.gent.vsc.SC: line 425: 112982 Segmentation fault      $DMTCP_RESTART --coord-port $coord_port `find $chkdir -name '*.dmtcp'`