Open zachmprince opened 1 month ago
Additionally, using the
syncDbAfterWrite
setting does not work. You can't copy an HDF5 file from one location to another while the file handle is open. This raises aPermissionError
.
Another way to solve the core issue of "database isn't available when worker process fails" is to get the syncDbAfterWrite
setting to work. This would enable us to avoid the complicated MPI communication outlined above that would be required for the worker to notify the root process when it fails.
To get syncDbAfterWrite
to work, we would need to close the HDF5 file handle before copying it from fast path to working directory, and then re-open the file. This is very simple in terms of implementation, but it is a "brute-force" solution in terms of bandwidth -- we're copying a file from one location to another multiple times during a run, and that file might be a few GB.
However, if we are running on a platform that has a relatively high-bandwidth connection between the fast path and working directory, the syncDbAfterWrte
solution would be attractive. We might expect on the order of 5-10 s for the database copy, which happens once per time node. This would be a negligible time penalty for a large, complex ARMI app that runs for anywhere from 10s of minutes to a few hours at each time node.
Description
The database is written in the fast path (temporary directory) during a run. At the end of a run, the file is copied to the working directory. When an error is raised on the main process (
context.MPI_RANK == 0
), the database is copied back to the working directory. When an error is raised on a worker process, the database is not copied back to the working directory, and thus the file is lost.The reason this happens is because the process that fails calls
context.MPI_COMM.Abort(errorcode=-1)
, which forces all of the other processes to abort immediately. If the main process is the first to fail, it copies the database back before calling the abort viaDatabaseInterface::interactError()
, i.e., a "graceful" failure and exit. However, if the process that fails is not the main process, the main process will be forced to abort before getting a chance to copy the DB back to the working directory. ThisMPI_COMM.Abort
is called by ARMI's__main__.py
:https://github.com/terrapower/armi/blob/31befeb1c749177ec61e9f7cb1104a6df0e66892/armi/__main__.py#L66
Additionally, using the
syncDbAfterWrite
setting does not work. You can't copy an HDF5 file from one location to another while the file handle is open. This raises aPermissionError
.Reproduction of Problem
Below is a script that imposes an exception to be thrown during
DatabaseInterface::interactEveryNode
. I will name this filerunDatabaseWriteOnFailure.py
.To show the behavior, you run this script as such:
Failing on main processor:
This will create an
MPI_ABORT
message and if youls
intofailedDatabase/
you will see thatfailedDatabase.h5
has been written.Failing on worker processor:
You will see a similar
MPI_ABORT
message, but doing als
intofailedDatabase/
shows that the h5 file was not written.Potential Solutions
To solve this issue, the failing worker needs send a message to the main processor that it is failing so the main processor can output the database before
MPI_ABORT
is called. However, the main processor need to know to receive this message at some point. We can't probe for the message at prescribed locations since there might be abcast/gather/scatter
before the receive command that causes the process to hang. With in mind, I've come up with following potential solutions.Wrapper Around
MPI_COMM
This solution involves having a wrapper class around
context.MPI_COMM
. Basically, when an exception is thrown on a worker process, it sends a designated message to the main process that it is failing. The wrapper probes for this messages before everybcast/gather/scatter
and throws an exception if the message is received. This is extremely intrusive and not future-proof if someone wants to do a communication not supported by the wrapper.Failure Check Around Troublesome Calls
Basically, it would look like this:
Basically, this context manager will have a
__exit__()
method that sends/receives a message to/from main/worker processors whethercommonFailingCode()
has thrown an exception. Possibly something like this:I'm not positive this exactly will work as written here, but the idea is there. Essentially, the
CheckFailure.__exit__()
will send the main process a status message, and if the main process receives a non-zero message, it raises it's own exception, go throughinteractError
and callMPI_ABORT
itself. The failing worker will raise aSystemExit
which will be handled by__main__.py
to not callMPI_ABORT
and just gracefully finalize the process. This solution is much less intrusive, but it requires developers to put in their own checks, or else it will go back to the original behavior. The major drawback is that ifcommonFailingCode()
has it's own MPI calls and exception is thrown before them on a worker process, the run will hang. So developers need to be careful when inserting this into their code, making this a dangerous capability.