hpcugent / csub

Generate a wrapper script around DMTCP and the job submission system to auto checkpoint certain jobs.
GNU General Public License v3.0
2 stars 4 forks source link

This repository contains code to generate a csub script, this is wrapper script around qsub and blcr, which will take a command, and automatically checkpoint it. If a job is about to run out of it's wall time, the script will use blcr to checkpoint all it's information, and resubmit it, until the command is done. This currently does not work very well for multi threaded jobs, and not at all for mpi jobs. We could switch to dmtcp and test if this works as advertised, see https://github.com/hpcugent/csub/issues/2

Generate a csub for your environment

Generate a wrapper script around blcr and the job submission system to auto checkpoint certain jobs.

Edit the base.sh and epologue.sh files so they are to your liking. Edit the constant variables at the top of the csub.py file to match your environment. run python makecsub.py

This will generate a csub executable script which can be used to submit jobs that will be automatically checkpointed using bclr (bclr should be installed on the worker nodes, it is not required on the job submission nodes).

Using csub

One important caveat is that the job script (or the applications run in the script) should not create it's own local temporary directories.

Also note that adding PBS directives (#PBS) in the job script is useless, as they will be ignored by csub. Controlling job parameters should be done via the csub command line.

Help on the various command line parameters supported by csub can be obtained using csub -h.

Some notable options:

Array jobs

csub has support for checkpointing array jobs. Just specify -t <spec> on the csub command line (see qsub for details).

MPI support

The BLCR checkpointing mechanism behind csub has support for checkpointing MPI applications. However, checkpointing MPI applications is pretty much untested up until now. If you would like to use csub with your MPI applications, you should help us replace blcr with dmtcp. (see http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/cooperman.pdf)

Notes

If you would like to time how long the complete job executes, just prepend the main command in your job script with time, e.g.: time . The real time will not make sense as it will also include the time passes between two checkpointed subjobs. However, the user time should give a good indication of the actual time it took to run your command, even if multiple checkpoints were performed.