mpickpt / mana

MANA for MPI
35 stars 24 forks source link

Correctness - Reproducible reduction operation. #313

Closed tarunsmalviya closed 1 year ago

tarunsmalviya commented 1 year ago

This version, MPI_Allreduce_reproducible, can be called from the MPI_Allreduce wrapper and returned. If desired, it could be called selectively on certain sizes or certain types or certain op's. This function is only meaningful for floating-point datatypes, as floating-point operations are non-associative.

On the Reproducibility of MPI Reduction Operations

An optimisation of allreduce communication in message-passing systems

MPI standard:

Advice to users

Some applications may not be able to ignore the non-associative nature of floating-point operations or may use user-defined operations (see Section 5.9.5) that require a special reduction order and cannot be treated as associative. Such applications should enforce the order of evaluation explicitly. For example, in the case of operations that require a strict left-to-right (or right-to-left evaluation order, this could be done by gathering all operands at a single process (e.g., with MPI_GATHER), applying the reduction operation in the desired order (e.g., with MPI_REDUCE_LOCAL), and if needed, broadcast or scatter the result to the other processes (e.g., with MPI_BCAST).

End of advice to users

Note: MPI_Waitany and MPI_Scan and MPI_Allreduce can receive messages non-deterministically.

Set the MANA_USE_ALLREDUCE_REPRODUCIBLE environment variable to enable (>0) or disable (=0) reproducible reduction operation.

tarunsmalviya commented 1 year ago

I have tested this branch with multiple checkpoint-resume and checkpoint-restart scenarios, using the same configuration that Tom uses in his automated testing. The job finished without any issues.

To test for correctness, I followed these steps:

  1. Generated a reference VASP5 output file using the configuration (without checkpoint-resume/restart) that Tom uses in his automated testing for 128 ranks, with reproducible reduction operation enabled. OSZICAR.128.reproducible.ref.txt
  2. Generated a VASP5 output file using the same configuration but with checkpoint-resume and checkpoint-restart. OSZICAR.128.reproducible.txt
  3. Compared the two output files and found no mismatches in the RMM values. diff_OSZICAR.128.reproducible.ref_OSZICAR.128.reproducible.csv

To test the runtime overhead of reproducible reduction operation, I followed these steps:

  1. Generated a reference VASP5 output file using the same configuration (without checkpoint-resume/restart) as above, but with reproducible reduction operation disabled. OSZICAR.128.non-reproducible.ref.txt
  2. Compared the number of RMM values generated in the two scenarios (reproducible and non-reproducible) and found no runtime overhead. In fact, reproducible reduction operation (192) generated more RMM values than non-reproducible reduction operation (177).