Checkpointing - Githubissues

HPC cluster with queuing systems often set time limits that are too tight for longer running mTE-IDTxl analyses. If these analyses would occasionally write checkpoint-files containing the neccessary information to resume the computations and to the necessary information to finally store the results where they belong, i.e. the desired results file name ('DRFN'), then just about any queue limits would do.

Information neccessary for succesfully resuming and finalizing the computation

DRFN --> this could be easily obtained by writing checkpoint files with the name {DRFN}.ckp, resuming from a checkpoint then only requires reconstructing the DRFN from the checkpoint file that was read in (and its path).
analyses settings & confoguration
data object (!) <-- this is an issue, because writing this to a checkpoint file will take time) OR
raw data location and import routine information
current state of the computation (current state of candidates and already accepted sources)

Handling of checkpoint files: Problem: One does not want 50 versioned checkpoint files for a single analysis cluttering the disk. So once a new checkpoint file is succesfully written, the old one should be deleted. However, if the write fails while writing the new checkpoint file (because the process is killed by the queue-management system), then there should be a way to recover from this situation.

I would suggest to: (1) move the old checkpoint file (if any) to .ckp.old, (2) start writing the new checkpoint file, (3) If (2) was successful, remove the .ckp.old file. This ensures we'll always have a way to recover.

pwollstadt / IDTxl

Checkpointing #23