Closed hatmer closed 2 years ago
I'm afraid that will not work - OMPI v4 has no concept of continuing in that situation.
You will need to install your own version of OMPI 4.x with resilient capabilities as indicated here.
Just to clarify: both answers are technically correct. 😉
I installed Open MPI 5.0.0 (configured with the -with-ft=mpi flag) and it works perfectly. Thank you!
Background information
I am implementing a fault-tolerant version of a large software project (ArgoDSM) that relies on MPI for managing nodes.
What version of Open MPI are you using?
v4.1.2
Describe how Open MPI was installed
tarball
Please describe the system on which you are running
Details of the problem
I have a two-node system. I want the individual nodes to continue running after the network link between them is severed.
When I simulate a network failure (by cutting a node off from the network using iptables), mpirun crashes and I get the following error:
I understand this to mean that mpirun sends a KILL -9 signal when it detects that it cannot reach the remote host. How do I prevent mpirun from terminating? It would be nice if instead of having a KILL -9 signal, I could set MPI_ERRORS_RETURN and deal with the "node unreachable" event as an MPI error.
I am aware of ftmpi, but as far as I can tell it does not prevent mpirun from terminating due to an unreachable node.