Bug: Can't restrict parallel output to image 1

In our parallel version of Zgoubi, we can restrict each image to working on those particles it “owns”, and we can gather those results correctly to image 1. The output, however, is problematic:

All Images Write: We adjust filenames so that image N writes to zgoubi.res_imageN, and that works correctly. Image 1 writes the full correct data for the simulation.

Only Image 1 Writes: Code crashes, reporting

Fatal error in PMPI_Reduce: Message truncated, error stack:
PMPI_Reduce(1258)..............: MPI_Reduce(sbuf=0x1f146b08, rbuf=(nil), count=40000, dtype=0x4c000829, MPI_SUM, root=0, comm=0x84000002) failed
MPIR_Reduce_impl(1070).........: 
MPIR_Reduce_intra(843).........: 
MPIR_Reduce_impl(1070).........: 
MPIR_Reduce_intra(868).........: 
MPIR_Reduce_redscat_gather(469): 
do_cts(548)....................: Message truncated; 280000 bytes received but buffer size is 160000
Fatal error in PMPI_Reduce: Other MPI error, error stack:
PMPI_Reduce(1258)..............: MPI_Reduce(sbuf=MPI_IN_PLACE, rbuf=0x5a26a0, count=70000, dtype=0x4c000829, MPI_SUM, root=0, comm=0x84000004) failed
MPIR_Reduce_impl(1070).........: 
MPIR_Reduce_intra(843).........: 
MPIR_Reduce_impl(1070).........: 
MPIR_Reduce_intra(868).........: 
MPIR_Reduce_redscat_gather(648): Failure during collective
Error: Command:
`/usr/lib64/mpich/bin/mpiexec -n 2 ../../zgoubi -in ePSR.dat`
failed to run.

OpenCoarrays uses MPI under the hood (in this case it's mpich, I think). Hence the error message Fatal error in PMPI_Reduce suggests that the gather (implemented using co_sum) is problematic. I tried instrumenting the code to see which of the gather's is crashing, but in spite of the fact that I put a flush after each of my write statements, I don't see all the reports I ought to.

Zaak built Zgoubi with array-bounds-checking turned on. He reports that Zgoubi is clean: No array-bounds violations.

At the moment, Damian and I suspect a compiler bug: but absent a M(n)WE, we can't tell for sure. Damian has tested this under gfortran 8.3.1, 9.2.0, and trunk—all with similar results. Damian has communicated this puzzle to the GFortran developers: fortran@gcc.gnu.org.

One next step is trying a completely different compiler—say, intel and/or NAG. Not sure what else to do at this point, so happy to entertain suggestions.

@dtabell good summary above. Thanks for closing this. I confirm that MPICH is the default MPI used with OpenCoarrays, but other MPI implementations work also. For example, if you use homebrew to install OpenCoarrays on macOS, the resulting installation will use OpenMPI.

FWIW, switching compilers requires modifying the cmake scripts. In the case of NAG, it also requires modifying the code to work around two of NAG's missing features: submodule and co_sum. Eliminating submodule should be trivial. To replace co_sum, you might borrow the co_sum emulators that I wrote for working with Intel compiler versions before the current version 19.1, which now supports co_sum. The won't be as fast as the OpenCoarrays co_sum, but they have been tested considerably and should work.

Also, replacing MPI all-together might well be trivial because the ability to use a range of parallel programming models was one of the main motivations behind OpenCoarrays. If it works out as I suspect, not even a single line of zgoubi would have to be rewritten to swap OpenSHMEM in for MPI. Even better, the unmodified Fortran source doesn't even have to be recompiled. The object files can just be relinked to the OpenCoarrays OpenSHMEM library instead of our MPI library. We did so for the OpenSHMEM results in a 2017 paper and it was a life-saver because of problems with the MPI implementation we were using. And it just so happens that the person who wrote the OpenSHMEM library works across the street from you!

Compiling with NAG is another way to eliminate MPI. NAG uses multithreading (presumably POSIX threads) instead of MPI so there will be zero chance of an MPI error in an executable program compiled with NAG.

I hope these ideas are helpful.

radiasoft / zgoubi

Bug: Can't restrict parallel output to image 1 #80