Open AndrewGaspar opened 3 years ago
Also pinging @jmstone here regarding the use of the physics id piece in Athena++
With respect to mitigation, how about removing the physics ID entirely and use different communicators for different physics? I'm still concerned if the tag limit that we are guaranteed to work with is limited to 32k (especially for global block ids), similarly even for the larger, non-standard tag limit of 2^21-1, it would limit the total number of blocks to 32k (using 6 bits for the buffer id), which is very much in range of (or even below) the targeted application scale.
Also tagging @felker as he actually left a note on that in Athena++ :smile:
//! Note, the MPI standard requires signed int tag,
//! with MPI_TAG_UB>= 2^15-1 = 32,767 (inclusive)
//!
//! \todo (felker):
//! - Consider adding safety check: if (tag > MPI_TAG_UB) ATHENA_ERROR()
I remember talking to Jeff Hammond about this topic and his BigMPI project after going down this MPI rabbit hole via Athena++.
Cray MPICH is of course the oddball here with such a small MPI_TAG_UB
, but I thought @jeffhammond mentioned that this limit was being increased. But the minimum imposed by the MPI standard will never change.
@tomidakn @jmstone and I had a conversation at the time for choosing up to 32 physics tags. We were already using around 10 with hydro vars, hydro flux, shearing box variables and fluxes, face field variables and fluxes, and scalar variables and fluxes. We estimated the future needs for radiation variables, etc.
Cray MPICH is of course the oddball here with such a small
MPI_TAG_UB
, but I thought @jeffhammond mentioned that this limit was being increased. But the minimum imposed by the MPI standard will never change.
According to TACC support "Intel restricted the max value of a tag to 1048576 in impi 19 versions", which would make this potentially an issue beyond Cray MPICH.
https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/01-5Monday/3C-Pagel/pagel-paper.pdf explains why Cray has the limit it does.
You need to figure out how to use tag bits more efficiently. It's not practical to assume more than MPI_TAG_UB. In my experience, applications than need more than 32Ki tag bits are using a bad design.
@jeffhammond what alternatives for encoding metadata that identifies the appropriate recv buffer would you suggest, in general?
The need to encode so many bits of destination block and buffer metadata in the MPI Tag is a consequence of the original Athena++ MeshBlock parallelism design. During the halo exchanges, there could be many point-to-point messages between two ranks posted at the same time, i.e. (from what I recall) there is no strict ordering from which the block/buffer info can be implicitly derived. Although the MPI_Test
attempts are posted in neighbor ordering, during the receive task on each MeshBlock.
You cannot store this metadata inside the message without forcing an additional deep copy from a temporary receive buffer which is used for inspecting the block and neighbor buffer metadata. Maybe the buffer ID 6 bits could be moved from the tag to the message contents, and a deep copy avoided if the recv buffers for each MeshBlock were unified at the cost of a short search during the unpack?
Setting up many sub-communicators is an option, but the maximum number of MPI communicators is implementation-defined, and was only 8189 in https://www.mcs.anl.gov/uploads/cels/papers/P1621.pdf . So obviously you couldnt have a communicator per MeshBlock with all its neighbors, or something extreme like that.
As @pgrete noted, this has only become a problem when using GPUs, since only 1 MPI rank per GPU is used, and each rank requires many MeshBlocks to achieve good efficiency. And reducing the the physics ID down to 1 or 2 bits by unifying all cell-centered, face-centered, etc. boundary buffers will probably forestall the impact of the MPI_TAG_UB
on problems of interest (until the next generation of devices, perhaps).
Remember that the limit on communicators is per process, not global. Every process can create a communicator that includes all of its neighbors and that counts as one communicator against the limit of 8000+.
Edit: it might require O(1) communicators, i.e. max(number of neighbors) in the worst case. Ideally, you will figure out how to create batches of nonoverlapping cliques for your calls to MPI_Comm_split
but in the worst case, recall that the output communicator is MPI_COMM_NULL
when the color is MPI_UNDEFINED
. This means -- in the worst case -- you can create a communicator for every process and its neighbors doing the following:
for all processes:
neighborhood = get list of process neighbors
color = (me in neighborhood) ? 0 : MPI_UNDEFINED
MPI_Comm_split(..color..newcomm)
end
This is going to do O(np)
allgathers, which is not ideal but it's better than nothing.
It might be possible to add the metadata that encodes the MeshBlock gid, buffer, and variables for which the corresponding data is intended directly to the message itself instead of trying to encode it in the tag. Just a thought.
@jeffhammond thanks for the insight! It would require a fairly invasive set of changes to the current code, but will be a good option if the reduction of the physics ID is insufficient long term.
@jmstone but then the MPI_Recv_init()
calls would not know which MeshBlock and neighbor recv buffer to use until the message is recv'd.
Reviving this issue given the discussion on the parthenon call today.
Is there some reason not to create a duplicate communicator of MPI_COMM_WORLD
for each field that has MetaData::FillGhost
and use the tag for just identifying the lid
s of the sending and receiving processes? It seems like the communicator limit is probably not too much of a limitation in this case. I am not aware of any use case that would have ~8000 separate fields.
Is there some reason not to create a duplicate communicator of
MPI_COMM_WORLD
for each field that hasMetaData::FillGhost
and use the tag for just identifying thelid
s of the sending and receiving processes? It seems like the communicator limit is probably not too much of a limitation in this case. I am not aware of any use case that would have ~8000 separate fields.
I like this idea.
I am just not sure if that has any performance implications. Running some simple tests suggests that it is almost exactly the same, but the tests were pretty idealized.
We just got bit by this issue with the "advection_performance" test running on 4 ranks (yeah, I know, we're not supposed to do that) on Trinitite using cray-mpich: https://github.com/lanl/SNAP/issues/6
Our current communication tags are structured as:
So 5 bits (range [0,31]) for physics ID, 6 bits (range [0,63]) for buffer ID, and 21 bits (range [0,2097151]) for the block ID. So we should have room in our tag for over 2 million blocks per rank, which seems reasonable enough, right?
Unfortunately, cray-mpich limits us, based on our current tag logic, where the block ID is padded out 11 bits, to no more than 2^10 (1024) blocks per rank.
Targeting cray mpich is something that is important to us at LANL. We should brainstorm methods for improving this situation.
A few (brief) ideas off the top of my head:
AdvanceCounterPhysID
. But maybe apps do? @pgrete @jdolencecc @jlippuner