mpickpt / mana

MANA for MPI
35 stars 24 forks source link

MANA prints timings to stderr #360

Closed gc00 closed 1 year ago

gc00 commented 1 year ago

See the FIXME for things that still need to be done.

This should also wait until we push in dev/gdc0/simplifyCopyBits, and then push this in on top of it.

This should make it easy to get reliable timings with MANA, from DMTCP_EVENT_INIT to DMTCP_EVENT_EXIT.

aeblyve commented 1 year ago

As it stands, though, this doesn't let you measure checkpoint-restart times. If I run an application with mana_coordinator srun -n 4 mana_launch ./wave_mpi.mana.exe, and checkpoint with mana_status -kc, the output only tells me when presuspend starts:

12:30:01: *** MANA: EVENT_INIT
12:30:01: *** MANA: EVENT_INIT
12:30:01: *** MANA: EVENT_INIT
12:30:01: *** MANA: EVENT_INIT
11 September 2023  12:30:01.362 PM

MPI_WAVE:
  FORTRAN90 version.
  Estimate a solution of the wave equation using MPI.

  Using  4 processes.
  Using a total of  40001 points.
  Using ****** time steps of size   0.125000E-04
  Computing final solution at time    12.5000
12:30:03: *** MANA: EVENT_PRESUSPEND (before checkpoint)
              Elapsed time since INIT/RESTART: 2 seconds

Restarting does not display the time required /to/ restart.

So, more timings should be added in the appropriate places if measuring checkpoint-restart (i.e., the time /to/ checkpoint and the time /to/ restart) is a goal.

I can now move on to test the variance doing it this way.

aeblyve commented 1 year ago

This lets me measure PRESUSPEND and PRECHECKPOINT times, which I suppose is the meaningfully different part between main and virtual-ids.

But it does not let me measure restart time (which we would expect to be meaningfully different between the two), as mana_restart does not parse --timing.

gc00 commented 1 year ago

@leonidbelyaev , This does allow you to measure restart times. Please launch with the --timing flag, and then restart from that checkpoint.

This works because it uses a MANA_TIMING environment variable. The environment variable is saved within the ckpt image, and so the restart also sees the MANA_TIMING environment variable.

I suppose I should document this within a help msg for bin/mana_launch --timing. It might be possible to have an independent --timing variable only for bin/mana_restart, but that would be complex to implement. Doing it this way is simpler (one environment variable triggering timings at launch and restart).