The MPI framework is unstable and leaves hanging mpiexec and MATLAB processes in the event of failures This causes errors on subsequent code execution (locked files) and leak processes.
These failures are observed regularly during test execution and are only resolved by manually identifying and killing the processes.
This is not sustainable in a production environment.
[ ] MPI framework that guarantees any MPI processes that it launches terminate cleanly or are killed
[ ] MPI framework that guarantees any MATLAB (Java) process that it launches terminate cleanly or are killed
[ ] Tests created that demonstrate this resilience to a range of "exceptional" situations:
-- unending node tasks (timeout)
-- non responsive tasks (timeout)
-- massive data files written (file)
-- files not written completely (file)
-- tiny data files written (file)
-- messages not received (comms)
-- nodes not starting (comms)
-- termination of parent node before worker nodes (process)
-- termination of worker node before completion of MPI task (process)
-- etc...
The MPI framework is unstable and leaves hanging mpiexec and MATLAB processes in the event of failures This causes errors on subsequent code execution (locked files) and leak processes.
These failures are observed regularly during test execution and are only resolved by manually identifying and killing the processes.
This is not sustainable in a production environment.
Related to: https://github.com/pace-neutrons/Horace/issues/161
Output: