openmc-dev / openmc

OpenMC Monte Carlo Code
https://docs.openmc.org
Other
772 stars 503 forks source link

Broadcasts potentially fail with large tallies #914

Open paulromano opened 7 years ago

paulromano commented 7 years ago

Some users have noticed that when running a model with very large tallies (using the current develop branch), they are getting an error at the end of the simulation when tally results are broadcasted:

Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1584)........: MPI_Bcast(buf=0x7f0cdedf2010, count=363076560, dtype=0x4c000829, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1436)...: 
MPIR_Bcast(1460)........: 
MPIR_Bcast_intra(1241)..: 
MPIR_SMP_Bcast(1085)....: 
MPIR_Bcast_binomial(250): message sizes do not match across processes in the collective routine: Received -32766 but expected -1390354816

After investigating a little bit, this appears to be caused by a bug in MPICH. Not much we can do other than wait for MPICH to be fixed (there is a PR proposing a fix), or tell users to use OpenMPI if they have large tallies.

Note that this only occurs on the develop branch (not on version 0.9.0). Broadcasting of tally results was added in #903.

paulromano commented 6 years ago

@liangjg pointed out to me that the problem is a little more general. If we try to broadcast a tally results array with more the 2^31 elements, the call to MPI_BCAST will fail because the count argument is expected to be an int. It looks like one reasonable approach is to create a new contiguous type that aggregates multiple results such that the count is reduced below 2^31. I'll take a shot at implementing this.

paulromano commented 4 years ago

I've confirmed that the original bug in MPICH which prevented broadcasting more than 2 GB of data is now fixed. However, it is still the case that if you have a tally with more filter combinations than can be represented with an int, it will still fail. One way we could get around this is to aggregate all results for a given filter into a contiguous type to reduce the count below 2^31.