Closed tarunsmalviya closed 1 year ago
I have tested this branch with multiple checkpoint-resume and checkpoint-restart scenarios, using the same configuration that Tom uses in his automated testing. The job finished without any issues.
To test for correctness, I followed these steps:
To test the runtime overhead of reproducible reduction operation, I followed these steps:
This version,
MPI_Allreduce_reproducible
, can be called from theMPI_Allreduce
wrapper and returned. If desired, it could be called selectively on certain sizes or certain types or certain op's. This function is only meaningful for floating-point datatypes, as floating-point operations are non-associative.On the Reproducibility of MPI Reduction Operations
An optimisation of allreduce communication in message-passing systems
MPI standard:
Note:
MPI_Waitany
andMPI_Scan
andMPI_Allreduce
can receive messages non-deterministically.Set the
MANA_USE_ALLREDUCE_REPRODUCIBLE
environment variable to enable (>0) or disable (=0) reproducible reduction operation.