Multiple chains case is not resilient when of the forward problems in one of the chains fails

Problem Context: I've been running some chemical kinetics inadequacy stuff lately using Queso. I built it successfully on our machine over here and runs with a single core seem to go pretty well.

I want to run with many cores. I've read through the documentation, and I think I set everything up okay. In fact, I am able to run successfully with 16 cores. Just to be clear, my forward solves are serial. When I say "run with many cores" I mean that I'm generating N_cores chains with the serial forward solves (section 4.3 in the manual).

Here's my problem: One of the jobs gets stuck. This is not a Queso problem; it's a known issue with the ODE solver I'm using and my mathematical formulation. This does impact my simulations though. The 15 other jobs complete successfully but because one of the jobs isn't finishing in my requested time, Queso doesn't generate a combined chain from the 15 successful jobs.

Here's what I've tried: I though I saw a way to have Queso write out the chains from each job that finishes successfully. For example, if I'm using 3 cores and 2 of them finish their job while 1 fails, the two successful ones should still write out their chains. Unfortunately, this isn't working for me. I might have mis-specified something in my input file.

In section 5.3.7 of the Queso manual, it says that specifying ip_mh_rawChain_dataOutputAllowedSet in the input file should allow writing individual chains upon completion of the individual job.

Why do you think that if one forward evaluation fails, the others should still produce chains? I feel like this will depend on the MPI stack; if one MPI process fails I don't know if MPI guarantees the other MPI processes can continue in a fault-tolerant way. I understand conceptually what you're asking, I'm just not sure if MPI is happy about it.

If MPI isn't happy about it, I wonder if there's something we (QUESO) can do to deal with it. This is something I think is an excellent use-case for fault-tolerant parallel software.

My particular use-case is embarrassingly parallel. The forward solve isn't parallel. It was my understanding that in this situation the full chain is distributed across the processes but the processes do not communicate chain information until the very end when the entire chain is reconstructed. In this special case, some of the chains will complete before others. So I was wondering if QUESO could have a process write out its chain when it's done. The user could then construct a partial chain from the chains that were written to disk.

Embarrassingly parallel or not doesn't really matter; MPI doesn't care. MPI sees a process failure (presumably it exits with a nonzero exit code) and then what happens next is... I don't know. Is it up to the MPI stack to decide what to do when a single MPI process fails? Perhaps I need to look at the MPI standard.

The process isn't failing. MPI isn't exiting. The forward solve just hangs. Calculations are still being made (presumably). Eventually I get a time-out.

Then I presume the application would hang at MPI_Finalize() as there's an implicit barrier there. If that were the case, then your other chains would have already written their respective outputs. If that's the case, then I'm starting to think this is a QUESO bug.

Perhaps I can work on a MWE where one of the forward solves just calls sleep().

libqueso / queso

Multiple chains case is not resilient when of the forward problems in one of the chains fails #662