upperwal / EntangledMPI

Fault Tolerance framework for High Performance Computing [Supports ULFM, replication and checkpointing]
MIT License
2 stars 1 forks source link

Uncoordinated Checkpoints #38

Closed bethven closed 6 years ago

bethven commented 6 years ago

Hi, I would like to know if with upperwal/EntangledMPI you can run Uncoordinated Checkpoints with MPI applications. I need to do this and I do not know with what I can do it. Thank you very much.

upperwal commented 6 years ago

EntangledMPI does support uncoordinated checkpointing but as the library is under heavy development I don't think checkpointing is stable right now, sorry about that.

Although you can check other well established frameworks like condor checkpoint/restart

bethven commented 6 years ago

hello, thank you very much for your quick response, with Condor can you do an uncoordinated checkpoint? I need to make an MPI program could be one of the NAS an uncoordinated checkpoint, but I do not know what tools I should use to do it. I have worked with the DMTCP library to do a coodinated checkpoint, but now I need to do an uncoodinated checkpoint using any tool. But I am a bit confused about the steps I must take to achieve it. thank you very much.

upperwal commented 6 years ago

Condor does support uncoordinated checkpointing although I haven't used it in any of my work. You can also try SRS library which is a user level checkpointing library. You can define checkpoints inside your code and checkpointing would happen accordingly.