paboyle / Grid

Data parallel C++ mathematical object library
GNU General Public License v2.0
149 stars 106 forks source link

Deadlock of RHMC caused by MPI_FIle_open #331

Closed i-kanamori closed 3 years ago

i-kanamori commented 3 years ago

In some implementations of MPI-IO, MPI_File_open internally calls srand() on the master rank, and we can not assume that the rand() on each MPI process returns the same value. This causes a deadlock in RHMC, where check of the range of eigenvalues is triggered by using rand().

It seems roimo123 is responsible to this behaviour.
https://github.com/open-mpi/ompi/blob/master/ompi/mca/io/romio321/romio/adio/common/shfp_fname.c#L32

A workaround is to pass --mca io ompio to mpiexec. We can also modify Grid/qcd/action/pseudofermion/OneFlavourEvenOddRationalRatio.h etc. not to use rand() but to use rand_r() (or broadcast the result of rand() from the master rank).

paboyle commented 3 years ago

Hi - I committed a patch to broadcast the result of rand() Sorry about that.