Open nathanweeks opened 5 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
ULFM2 is a fork of Open-MPI that implements ULFM-based fault tolerance.
When OpenCoarrays 2.4.0 is configured with CAF_ENABLE_FAILED_IMAGES=TRUE, the cafrun wrapper script adds the
--disable-auto-cleanup
option to mpiexec to allow (an MPICH-based) MPI to continue execution in the event of an MPI process failure. If the user doesn't want fault tolerance, the user can specify the (MPICH-mpiexec-specific)--reenable-auto-cleanup
option.The ULFM2 mpiexec has neither of these options; rather, the equivalent of
--disable-auto-cleanup
is assumed by default. Fault tolerance can be disabled with the--disable-recovery
mpiexec option.It would be beneficial if cafrun could accommodate the ULFM2 mpiexec syntax. One possible approach would be to change the
--reenable-auto-cleanup
cafrun option to something more generic & descriptive (like--disable-failed-images
), and select an appropriate mpiexec option based on the MPI implementation.