hemelb-codes / hemelb

A high performance parallel lattice-Boltzmann code for large scale fluid flow in complex geometries
GNU Lesser General Public License v3.0
34 stars 11 forks source link

CellArmy exceptions deadlocking simulations #753

Open mobernabeu opened 3 years ago

mobernabeu commented 3 years ago

CellArmy::Fluid2CellInteractions throws exceptions for a couple of anomalous situations (i.e. numerical instability) that can potentially occur in just a subset of the MPI ranks (in a single rank most of the times). I'm puzzled about why the catch blocks in main.cc are not picking them up and MPI_Aborting the whole simulation in ARCHER. This has lead to some costly deadlocks in production!

Any thoughts @rupertnash?

rupertnash commented 3 years ago

Hard to tell without knowing what MPI calls might be in flight when the exception triggers. Can you drop a link to the line(s) that throw in here please?

Is it that control never reaches the top level catch or that the call to MPI_Abort doesn't stop the sim?

mobernabeu commented 3 years ago

We had deadlocks with the throw statement in https://github.com/UCL-CCS/hemelb-dev/blob/ea7a49a561277ba7aa5d275d3413e6dc4d71d0ec/Code/redblood/CellArmy.h#L269

@CharlesQiZhou could you please send me stdout/stderr for one of those cases. I need to check whether https://github.com/UCL-CCS/hemelb-dev/blob/ea7a49a561277ba7aa5d275d3413e6dc4d71d0ec/Code/main.cc#L53 is being logged at all. That should allow me to answer @rupertnash second question.

mobernabeu commented 3 years ago

Closed by mistake, sorry. Reopening.

rupertnash commented 3 years ago

Nothing stands out to me.

I immediately notice that above: https://github.com/UCL-CCS/hemelb-dev/blob/ea7a49a561277ba7aa5d275d3413e6dc4d71d0ec/Code/redblood/CellArmy.h#L253 you are iterating over copies of the elements in cells which isn't usually what you want.

Does this matter? Does it cost performance?

I also see that you're calling std::map<>::at - is it this throwing? Can check with a temporary + IIFE ("iffy")

auto&& tmp = [&](){
  try {
    return nodeDistributions.at(cell->GetTag());
  } catch (std::out_of_range& e) {
    // Log an error an re-throw
    throw;
  }
}();
try {
  tmp.template Reindex<Stencil>(globalCoordsToProcMap, cell);
} catch (std::exception& e) {
  // log and rethrow as before
}
mobernabeu commented 3 years ago

Good points, @rupertnash. That loop should be over const&. std::map<>::at was defensive programming while debugging something that turned out to be unrelated. We can spare the bound check now. I'll make those change. Thanks for the iffy trick, I didn't know it.

@CharlesQiZhou I don't know if you were notified about my previous message cause I added your name in an edit. Please send me those files if you can.

CharlesQiZhou commented 3 years ago

@mobernabeu Sorry for missing your message. Just saw that from my personal mailbox. Please find the stderr/stdout files below: stderr.txt stdout.txt

CharlesQiZhou commented 3 years ago

This may be useful for debugging purpose. One recent benchmark test of mine triggering the CellArmy exception successfully invoked the MPI_ABORT and immediately terminated the simulation for local runs on a desktop (with 4 cores, about 10 min running). The same simulation running on ARCHER failed to invoke MPI_ABORT and ended in deadlocks.

Enclosed please find the log file from one of my local runs triggering the exception. log.txt

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

mobernabeu commented 3 years ago

Thanks @CharlesQiZhou. This is very bizarre indeed and needs more investigation to understand what part is broken (exception not being thrown, not being caught, MPI_ABORT deadlocking or not doing its job). I suggest that you try to replicate on Cirrus or ARCHER2 once the code is running there and add a bit more tracing to see which of the above is the culprit. Running it through a parallel debugger may be necessary if print statements are not sufficient.