Open paulromano opened 7 years ago
@liangjg pointed out to me that the problem is a little more general. If we try to broadcast a tally results array with more the 2^31 elements, the call to MPI_BCAST
will fail because the count
argument is expected to be an int
. It looks like one reasonable approach is to create a new contiguous type that aggregates multiple results such that the count is reduced below 2^31. I'll take a shot at implementing this.
I've confirmed that the original bug in MPICH which prevented broadcasting more than 2 GB of data is now fixed. However, it is still the case that if you have a tally with more filter combinations than can be represented with an int, it will still fail. One way we could get around this is to aggregate all results for a given filter into a contiguous type to reduce the count below 2^31.
Some users have noticed that when running a model with very large tallies (using the current develop branch), they are getting an error at the end of the simulation when tally results are broadcasted:
After investigating a little bit, this appears to be caused by a bug in MPICH. Not much we can do other than wait for MPICH to be fixed (there is a PR proposing a fix), or tell users to use OpenMPI if they have large tallies.
Note that this only occurs on the develop branch (not on version 0.9.0). Broadcasting of tally results was added in #903.