nipy / nipype

Workflows and interfaces for neuroimaging packages
https://nipype.readthedocs.org/en/latest/
Other
746 stars 529 forks source link

Checkpoint-restart with dmtcp #1134

Open oesteban opened 9 years ago

oesteban commented 9 years ago

Hi there,

How would you feel about integrating dmtcp (https://conference.scipy.org/scipy2013/presentation_detail.php?id=201) with nipype?

I'm currently missing it for the following. When using mrtrix3 or any other multithreaded method in debian jessie, there seems to be a bug causing a deadlock with my processor. This is a bit random, and it'd be a great relief for interfaces that take a lot of time running if you could continue them from a recent checkpoint, instead from the beginning.

This is just a thought, and I don't see clearly how this would be implemented in distributed environments (maybe keeping track of which unit you sent the job, and use dmtcp locally?).

What do you think?

satra commented 9 years ago

in nipype, interfaces already have an option to continue from where it left of. perhaps that bit can check if their is a dmtcp file to continue with?

oesteban commented 9 years ago

I wasn't aware of this, how do you set that option??. Meaning, all the registers and everything is set back to the saved position?

Regarding dmtcp, I'm not sure we can incorporate it at interface level. If so, that's the way to go.

satra commented 9 years ago

an interface can have a property that allows resuming using partial results.

https://github.com/nipy/nipype/blob/master/nipype/interfaces/base.py#L646

this is used in the pipeline engine to determine whether the entire directory gets removed or not. something like this could be used in conjunction with dmtcp, i think. if there is a dmtcp file found in the node's working directory and the node needs to rerun it will restart from the checkpoint.

now some of this checkpointing can also be done at the plugin level, leave it to the scheduler to do so, rather than the interface. i haven't thought this through well enough to figure out which one is better and whether both would be useful.

djarecka commented 6 years ago

@oesteban @satra do you have any new comments regarding this issue?

oesteban commented 6 years ago

This would be great to try, but I honestly can't do it right now. A great project for someone with certain knowledge about nipype (definitely not for a beginner).