Stronger Fenix Integration

Matthew-Whitlock commented 2 weeks ago

We want to simplify more of the Fenix+KokkosResilience process, focusing for now on implementing the global recovery approach but keeping future localized recovery flows in mind.

Here's the current basic flow for MiniMD:

/**** Preinit ****/
// Application does initialization that doesn't depend on MPI

MPI_Init()

MPI_Comm res_comm;
Fenix_Init(&res_comm);

If initial_rank:
  /**** Init ****/
  // No failures have happened. Application does some MPI-dependent init
  kr_ctx = KokkosResilience::make_context()
Elif recovered_rank:
  /**** Recovery Re-Init ****/
  //Rank died and this is a replacement spare rank. 
  //Re-do the MPI-dependent init, possibly with some alterations
  kr_ctx = KokkosResilience::make_context()
Elif survivor_rank:
  /**** Survivor Re-Init ****/
  //These ranks need to help the recovered ranks re-init, and swap to the new resilient communicator
  kr_ctx.reset(res_comm)

for i:
  kr_ctx.checkpoint(i, {
    //Application work
  });

Fenix_Finalize();
MPI_Finalize()

I'll leave some thoughts on directions to go as comments.

Matthew-Whitlock commented 2 weeks ago

Fenix Checkpoints

One very straightforward integration is to make a checkpointing backend in KokkosResilience that uses Fenix's checkpoint API for in-memory checkpoints

Matthew-Whitlock commented 2 weeks ago

Fenix Exceptions

One flaw with the current design is the use of longjmp, which is how Fenix returns to Fenix_Init after rebuilding a failed communicator. longjmp has lots of gross undefined behavior, particularly with c++ code. Fenix has an exception-based recovery mode as well, which we can work to automate integration with.

So instead of this:

If initial_rank:
  /**** Init ****/
Elif recovered_rank:
  /**** Recovery Re-Init ****/
Elif survivor_rank:
  /**** Survivor Re-Init ****/

for i:
  kr_ctx.checkpoint(i, {
    //Application work
  });

We could use something like this:

/**** Init ****/
//Failures may have happened, but only recovered ranks start back at Fenix_Init.
//Application does some MPI-dependent init, regardless of initial_rank or recovered_rank
kr_ctx = KokkosResilience::make_context();

kr_ctx.register_replacement_callback({
  /**** Recovery Re-Init ****/
  //In this case, recovered ranks have already gone through Init
  //  So this is just for any application-specific recovery logic after we reinit
});
kr_ctx.register_survivor_callback({
  /**** Survivor Re-Init ****/
  //Survivor Re-Init still needs to participate in the collective Init MPI calls
  //  In addition to any app-specific logic
  kr_ctx.reset(res_comm)
});

for i:
  kr_ctx.checkpoint(i, {
    //Application work
    //Any failures that happen in this region are caught by an exception handler in the 
    //  kr_ctx.checkpoint function. This handler would call the survivor callback, then
    //  recover to the last good checkpoint iteration and proceed.
  });

Matthew-Whitlock commented 2 weeks ago

Wrapping Fenix Init

We could also manage Fenix's initialization for the user, in one of two ways.

Unscoped: Pull Fenix_Init inside of kr::make_context, and Fenix_Finalize into the context's destructor
Scoped: Manage a scoped Fenix region as a lambda/function call

There are pros and cons. For both options, we can simplify things for the end users, but unless we maintain feature parity with Fenix's intialization options we're limiting what the user can do. Additionally, Fenix only (currently) supports a single Fenix_Init/Finalize per application so we'd be limited to one kr_ctx or scoped region.

Option 1 is simplest, but I worry about not knowing the specific ordering of the Fenix_Finalize call, since it would be based on the automatic context object's destructor. This also doesn't really give us any more ability in KokkosResilience, it's mostly just hiding the Fenix function calls.

Option 2 might give us more room to automate recovery. I'm imaging something like this:

MPI_Init()
kr_ctx = KokkosResilience::make_context(MPI_COMM_WORLD);

//KokkosResilience handles Fenix_Init/Finalize around this lambda
kr_ctx.online_recovery( [](MPI_Comm res_comm){
  /**** Init ****/

  for i:
    kr_ctx.checkpoint(i, {
      //Application work
    });
});

When failures happen in the online recovery lambda, KR has full control over updating the MPI_Comm. When exceptions are caught, KR can just restart the whole region. This limits the ability to localize recovery, which is the final goal ultimately, since all survivor ranks do a full re-init. We could probably mix this with the callbacks from the Fenix Exceptions integration above to restore that functionality though. We could integrate with a message logger to help automate localization in the future, since we have a scoped region for recovery

Matthew-Whitlock commented 2 weeks ago

Avoiding Re-init

A lot of the problems above relate to resetting state across the survivors/failed ranks. This is necessary in MiniMD since many non-checkpointed objects contain state variables. For the purposes of global recovery, though, we could just checkpoint those state variables as well. So we might be able to skip all of that complexity by using the Magistrate checkpointing work to expand to checkpointing non-KokkosView variables. This limits our automation though, since we can't automatically identify what to checkpoint/recover so the user needs to define the serializer as well as manually register the objects to checkpoint.

Maybe we could serialize the lambdas themselves? Our view detection works by copying the lambda since the views are stored inside the lambda. If we could figure out some way of just serializing the whole lambda along with all the data stored inside, that would help us out. No idea if that's possible though, since I think lambdas are each custom implementation-defined objects.

kokkos / kokkos-resilience