Open mobernabeu opened 3 years ago
Hard to tell without knowing what MPI calls might be in flight when the exception triggers. Can you drop a link to the line(s) that throw in here please?
Is it that control never reaches the top level catch
or that the call to MPI_Abort
doesn't stop the sim?
We had deadlocks with the throw
statement in https://github.com/UCL-CCS/hemelb-dev/blob/ea7a49a561277ba7aa5d275d3413e6dc4d71d0ec/Code/redblood/CellArmy.h#L269
@CharlesQiZhou could you please send me stdout/stderr for one of those cases. I need to check whether https://github.com/UCL-CCS/hemelb-dev/blob/ea7a49a561277ba7aa5d275d3413e6dc4d71d0ec/Code/main.cc#L53 is being logged at all. That should allow me to answer @rupertnash second question.
Closed by mistake, sorry. Reopening.
Nothing stands out to me.
I immediately notice that above:
https://github.com/UCL-CCS/hemelb-dev/blob/ea7a49a561277ba7aa5d275d3413e6dc4d71d0ec/Code/redblood/CellArmy.h#L253
you are iterating over copies of the elements in cells
which isn't usually what you want.
Does this matter? Does it cost performance?
I also see that you're calling std::map<>::at
- is it this throwing? Can check with a temporary + IIFE ("iffy")
auto&& tmp = [&](){
try {
return nodeDistributions.at(cell->GetTag());
} catch (std::out_of_range& e) {
// Log an error an re-throw
throw;
}
}();
try {
tmp.template Reindex<Stencil>(globalCoordsToProcMap, cell);
} catch (std::exception& e) {
// log and rethrow as before
}
Good points, @rupertnash. That loop should be over const&
. std::map<>::at
was defensive programming while debugging something that turned out to be unrelated. We can spare the bound check now. I'll make those change. Thanks for the iffy trick, I didn't know it.
@CharlesQiZhou I don't know if you were notified about my previous message cause I added your name in an edit. Please send me those files if you can.
@mobernabeu Sorry for missing your message. Just saw that from my personal mailbox. Please find the stderr/stdout files below: stderr.txt stdout.txt
This may be useful for debugging purpose. One recent benchmark test of mine triggering the CellArmy exception successfully invoked the MPI_ABORT and immediately terminated the simulation for local runs on a desktop (with 4 cores, about 10 min running). The same simulation running on ARCHER failed to invoke MPI_ABORT and ended in deadlocks.
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.
Thanks @CharlesQiZhou. This is very bizarre indeed and needs more investigation to understand what part is broken (exception not being thrown, not being caught, MPI_ABORT deadlocking or not doing its job). I suggest that you try to replicate on Cirrus or ARCHER2 once the code is running there and add a bit more tracing to see which of the above is the culprit. Running it through a parallel debugger may be necessary if print statements are not sufficient.
CellArmy::Fluid2CellInteractions
throws exceptions for a couple of anomalous situations (i.e. numerical instability) that can potentially occur in just a subset of the MPI ranks (in a single rank most of the times). I'm puzzled about why the catch blocks inmain.cc
are not picking them up andMPI_Abort
ing the whole simulation in ARCHER. This has lead to some costly deadlocks in production!Any thoughts @rupertnash?