Closed gc00 closed 1 year ago
As it stands, though, this doesn't let you measure checkpoint-restart times.
If I run an application with
mana_coordinator
srun -n 4 mana_launch ./wave_mpi.mana.exe
, and checkpoint with mana_status -kc
, the output only tells me when presuspend starts:
12:30:01: *** MANA: EVENT_INIT
12:30:01: *** MANA: EVENT_INIT
12:30:01: *** MANA: EVENT_INIT
12:30:01: *** MANA: EVENT_INIT
11 September 2023 12:30:01.362 PM
MPI_WAVE:
FORTRAN90 version.
Estimate a solution of the wave equation using MPI.
Using 4 processes.
Using a total of 40001 points.
Using ****** time steps of size 0.125000E-04
Computing final solution at time 12.5000
12:30:03: *** MANA: EVENT_PRESUSPEND (before checkpoint)
Elapsed time since INIT/RESTART: 2 seconds
Restarting does not display the time required /to/ restart.
So, more timings should be added in the appropriate places if measuring checkpoint-restart (i.e., the time /to/ checkpoint and the time /to/ restart) is a goal.
I can now move on to test the variance doing it this way.
This lets me measure PRESUSPEND and PRECHECKPOINT times, which I suppose is the meaningfully different part between main and virtual-ids.
But it does not let me measure restart time (which we would expect to be meaningfully different between the two), as mana_restart does not parse --timing
.
@leonidbelyaev , This does allow you to measure restart times. Please launch with the --timing
flag, and then restart from that checkpoint.
This works because it uses a MANA_TIMING
environment variable. The environment variable is saved within the ckpt image, and so the restart also sees the MANA_TIMING
environment variable.
I suppose I should document this within a help msg for bin/mana_launch --timing
. It might be possible to have an independent --timing
variable only for bin/mana_restart
, but that would be complex to implement. Doing it this way is simpler (one environment variable triggering timings at launch and restart).
See the FIXME for things that still need to be done.
This should also wait until we push in
dev/gdc0/simplifyCopyBits
, and then push this in on top of it.This should make it easy to get reliable timings with MANA, from DMTCP_EVENT_INIT to DMTCP_EVENT_EXIT.