Database is not recovered when worker process throws an exception

Description

The database is written in the fast path (temporary directory) during a run. At the end of a run, the file is copied to the working directory. When an error is raised on the main process (context.MPI_RANK == 0), the database is copied back to the working directory. When an error is raised on a worker process, the database is not copied back to the working directory, and thus the file is lost.

The reason this happens is because the process that fails calls context.MPI_COMM.Abort(errorcode=-1), which forces all of the other processes to abort immediately. If the main process is the first to fail, it copies the database back before calling the abort via DatabaseInterface::interactError(), i.e., a "graceful" failure and exit. However, if the process that fails is not the main process, the main process will be forced to abort before getting a chance to copy the DB back to the working directory. This MPI_COMM.Abort is called by ARMI's __main__.py:

https://github.com/terrapower/armi/blob/31befeb1c749177ec61e9f7cb1104a6df0e66892/armi/__main__.py#L66

Additionally, using the syncDbAfterWrite setting does not work. You can't copy an HDF5 file from one location to another while the file handle is open. This raises a PermissionError.

Reproduction of Problem

Below is a script that imposes an exception to be thrown during DatabaseInterface::interactEveryNode. I will name this file runDatabaseWriteOnFailure.py.

import os
import sys
from unittest import mock
import argparse
import shutil

import armi
from armi import context
from armi import runLog
from armi.__main__ import main as armi_main
from armi.bookkeeping.db.databaseInterface import DatabaseInterface
from armi.tests import TEST_ROOT
from armi.utils import pathTools, directoryChangers
from armi import settings

if not armi.isConfigured():
    armi.configure()
context.Mode.setMode(context.Mode.BATCH)

class FailingDatabaseInterface(DatabaseInterface):
    """
    Mock interface that throws an exception during `interactEveryNode` on a
    prescribed rank at time-node 1.
    """

    failingRank = 0

    def interactEveryNode(self, cycle, node):
        if node == 1:
            context.MPI_COMM.bcast("fail", root=0)
            self._fail()
        super().interactEveryNode(cycle, node)

    def workerOperate(self, cmd):
        if cmd == "fail":
            self._fail()
            return True
        return False

    def _fail(self):
        if self.failingRank == context.MPI_RANK:
            runLog.error(f"Failing interface on processor {context.MPI_RANK}")
            raise RuntimeError("Failing interface critical worker failure")

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-f",
        "--failingRank",
        type=int,
        default=0,
        help="Processor rank to throw exception on.",
    )
    args = parser.parse_args()

    # Move necessary files to new working directory
    workingDir = os.path.join(os.getcwd(), "failedDatabase")
    filesToMove = [
        "armiRun.yaml",
        "refSmallCoreGrid.yaml",
        "refSmallReactor.yaml",
        "refSmallReactorBase.yaml",
        "refSmallSfpGrid.yaml",
        "refSmallReactorShuffleLogic.py",
    ]
    pathTools.cleanPath(workingDir)
    if context.MPI_RANK == 0:
        os.makedirs(workingDir)
        for file in filesToMove:
            shutil.copy(os.path.join(TEST_ROOT, file), os.path.join(workingDir, file))

    # Change to working directory
    with directoryChangers.DirectoryChanger(workingDir, dumpOnException=False):
        # Create a new settings file with some modifications
        cs = settings.Settings("armiRun.yaml")
        newSettings = {
            "db": True,
            "nCycles": 1,
        }
        cs = cs.modified(newSettings=newSettings)
        cs.writeToYamlFile("failedDatabase.yaml")

        # Mock the DatabaseInterface with the one that will throw exception
        with mock.patch(
            "armi.bookkeeping.db.databaseInterface.DatabaseInterface",
            side_effect=FailingDatabaseInterface,
        ):
            # Prescribe the rank we want the interface to fail on
            FailingDatabaseInterface.failingRank = args.failingRank
            # Run ARMI main command as if it was done from command-line
            sys.argv = ["armi", "run", "failedDatabase.yaml"]
            armi_main()

if __name__ == "__main__":
    main()

To show the behavior, you run this script as such:

Failing on main processor:
```
  mpiexec -n 2 python runDatabaseWriteOnFailure.py -f 0
```
This will create an MPI_ABORT message and if you ls into failedDatabase/ you will see that failedDatabase.h5 has been written.
Failing on worker processor:
```
  mpiexec -n 2 python runDatabaseWriteOnFailure.py -f 1
```
You will see a similar MPI_ABORT message, but doing a ls into failedDatabase/ shows that the h5 file was not written.

Potential Solutions

To solve this issue, the failing worker needs send a message to the main processor that it is failing so the main processor can output the database before MPI_ABORT is called. However, the main processor need to know to receive this message at some point. We can't probe for the message at prescribed locations since there might be a bcast/gather/scatter before the receive command that causes the process to hang. With in mind, I've come up with following potential solutions.

Wrapper Around `MPI_COMM`

This solution involves having a wrapper class around context.MPI_COMM. Basically, when an exception is thrown on a worker process, it sends a designated message to the main process that it is failing. The wrapper probes for this messages before every bcast/gather/scatter and throws an exception if the message is received. This is extremely intrusive and not future-proof if someone wants to do a communication not supported by the wrapper.

Failure Check Around Troublesome Calls

Basically, it would look like this:

with context.CheckFailure():
    commonFailingCode()

Basically, this context manager will have a __exit__() method that sends/receives a message to/from main/worker processors whether commonFailingCode() has thrown an exception. Possibly something like this:

class CheckFailure:

    TAG_FAILURE = 93

    def __exit__(self, exception_type, exception_value, stacktrace):
        # Check to see if an exception occured
        err = any([exception_type, exception_value, stacktrace])
        if context.MPI_RANK == 0:
            # If the error occured on main process, then simply raise the exception
            if err:
                raise
            # Otherwise, see if any worker processors has thrown an exception
            else:
                status = 0
                for rank in range(1, context.MPI_COMM.size):
                    status += context.MPI_COMM.recv(source=rank, tag=self.TAG_FAILURE)
                if status > 0:
                    raise RuntimeError(
                        "Failure message has been received from a worker processor"
                    )
        else:
            if err:
                # Log the error
                runLog.error(
                    r"{}\n{}\{}".format(exception_type, exception_value, stacktrace)
                )
                # Tell main process to raise an exception
                context.MPI_COMM.send(1, dest=0, tag=self.TAG_FAILURE)
                # Gracefully terminate the process and rely on main process to call MPI_ABORT
                raise SystemExit
            else:
                # Otherwise, send message saying no error has occurred
                context.MPI_COMM.send(0, dest=0, tag=self.TAG_FAILURE)

I'm not positive this exactly will work as written here, but the idea is there. Essentially, the CheckFailure.__exit__() will send the main process a status message, and if the main process receives a non-zero message, it raises it's own exception, go through interactError and call MPI_ABORT itself. The failing worker will raise a SystemExit which will be handled by __main__.py to not call MPI_ABORT and just gracefully finalize the process. This solution is much less intrusive, but it requires developers to put in their own checks, or else it will go back to the original behavior. The major drawback is that if commonFailingCode() has it's own MPI calls and exception is thrown before them on a worker process, the run will hang. So developers need to be careful when inserting this into their code, making this a dangerous capability.

terrapower / armi