Consider Alternative MPI Communication Tagging Strategies

AndrewGaspar commented 3 years ago

We just got bit by this issue with the "advection_performance" test running on 4 ranks (yeah, I know, we're not supposed to do that) on Trinitite using cray-mpich: https://github.com/lanl/SNAP/issues/6

Our current communication tags are structured as:

32|----block ID----|11|----buffer ID----|5|----phsyics ID----|0

So 5 bits (range [0,31]) for physics ID, 6 bits (range [0,63]) for buffer ID, and 21 bits (range [0,2097151]) for the block ID. So we should have room in our tag for over 2 million blocks per rank, which seems reasonable enough, right?

Unfortunately, cray-mpich limits us, based on our current tag logic, where the block ID is padded out 11 bits, to no more than 2^10 (1024) blocks per rank.

Targeting cray mpich is something that is important to us at LANL. We should brainstorm methods for improving this situation.

A few (brief) ideas off the top of my head:

Reduce the physics ID size to 1 bit - doesn't seem like it's ever anything besides 0 or 1 based on my reading of the code? Nobody ever calls AdvanceCounterPhysID. But maybe apps do? @pgrete @jdolence
Make the size of the physics section of the tag configurable. This isn't a great long term solution, though.
Make communication tag allocation dynamic. This adds overhead of coordination, but it does mean we can use the entire tag space.
Map multiple blocks to the same tags. This requires that we're very careful about performing the communication for these blocks in a strict order. Maybe not a big deal, though?
Break blocks across multiple communicators.

cc @jlippuner

pgrete commented 3 years ago

Also pinging @jmstone here regarding the use of the physics id piece in Athena++

pgrete commented 3 years ago

With respect to mitigation, how about removing the physics ID entirely and use different communicators for different physics? I'm still concerned if the tag limit that we are guaranteed to work with is limited to 32k (especially for global block ids), similarly even for the larger, non-standard tag limit of 2^21-1, it would limit the total number of blocks to 32k (using 6 bits for the buffer id), which is very much in range of (or even below) the targeted application scale.

pgrete commented 3 years ago

Also tagging @felker as he actually left a note on that in Athena++ :smile:

//! Note, the MPI standard requires signed int tag,
//! with MPI_TAG_UB>= 2^15-1 = 32,767 (inclusive)
//!
//! \todo (felker):
//! - Consider adding safety check: if (tag > MPI_TAG_UB) ATHENA_ERROR()

felker commented 3 years ago

I remember talking to Jeff Hammond about this topic and his BigMPI project after going down this MPI rabbit hole via Athena++.

Cray MPICH is of course the oddball here with such a small MPI_TAG_UB, but I thought @jeffhammond mentioned that this limit was being increased. But the minimum imposed by the MPI standard will never change.

@tomidakn @jmstone and I had a conversation at the time for choosing up to 32 physics tags. We were already using around 10 with hydro vars, hydro flux, shearing box variables and fluxes, face field variables and fluxes, and scalar variables and fluxes. We estimated the future needs for radiation variables, etc.

pgrete commented 3 years ago

Cray MPICH is of course the oddball here with such a small MPI_TAG_UB, but I thought @jeffhammond mentioned that this limit was being increased. But the minimum imposed by the MPI standard will never change.

According to TACC support "Intel restricted the max value of a tag to 1048576 in impi 19 versions", which would make this potentially an issue beyond Cray MPICH.

jeffhammond commented 3 years ago

https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/01-5Monday/3C-Pagel/pagel-paper.pdf explains why Cray has the limit it does.

You need to figure out how to use tag bits more efficiently. It's not practical to assume more than MPI_TAG_UB. In my experience, applications than need more than 32Ki tag bits are using a bad design.

felker commented 3 years ago

@jeffhammond what alternatives for encoding metadata that identifies the appropriate recv buffer would you suggest, in general?

felker commented 3 years ago

The need to encode so many bits of destination block and buffer metadata in the MPI Tag is a consequence of the original Athena++ MeshBlock parallelism design. During the halo exchanges, there could be many point-to-point messages between two ranks posted at the same time, i.e. (from what I recall) there is no strict ordering from which the block/buffer info can be implicitly derived. Although the MPI_Test attempts are posted in neighbor ordering, during the receive task on each MeshBlock.

You cannot store this metadata inside the message without forcing an additional deep copy from a temporary receive buffer which is used for inspecting the block and neighbor buffer metadata. Maybe the buffer ID 6 bits could be moved from the tag to the message contents, and a deep copy avoided if the recv buffers for each MeshBlock were unified at the cost of a short search during the unpack?

Setting up many sub-communicators is an option, but the maximum number of MPI communicators is implementation-defined, and was only 8189 in https://www.mcs.anl.gov/uploads/cels/papers/P1621.pdf . So obviously you couldnt have a communicator per MeshBlock with all its neighbors, or something extreme like that.

As @pgrete noted, this has only become a problem when using GPUs, since only 1 MPI rank per GPU is used, and each rank requires many MeshBlocks to achieve good efficiency. And reducing the the physics ID down to 1 or 2 bits by unifying all cell-centered, face-centered, etc. boundary buffers will probably forestall the impact of the MPI_TAG_UB on problems of interest (until the next generation of devices, perhaps).

jeffhammond commented 3 years ago

Remember that the limit on communicators is per process, not global. Every process can create a communicator that includes all of its neighbors and that counts as one communicator against the limit of 8000+.

Edit: it might require O(1) communicators, i.e. max(number of neighbors) in the worst case. Ideally, you will figure out how to create batches of nonoverlapping cliques for your calls to MPI_Comm_split but in the worst case, recall that the output communicator is MPI_COMM_NULL when the color is MPI_UNDEFINED. This means -- in the worst case -- you can create a communicator for every process and its neighbors doing the following:

for all processes:
   neighborhood = get list of process neighbors
   color = (me in neighborhood) ? 0 : MPI_UNDEFINED
   MPI_Comm_split(..color..newcomm)
end

This is going to do O(np) allgathers, which is not ideal but it's better than nothing.

jmstone commented 3 years ago

It might be possible to add the metadata that encodes the MeshBlock gid, buffer, and variables for which the corresponding data is intended directly to the message itself instead of trying to encode it in the tag. Just a thought.

felker commented 3 years ago

@jeffhammond thanks for the insight! It would require a fairly invasive set of changes to the current code, but will be a good option if the reduction of the physics ID is insufficient long term.

@jmstone but then the MPI_Recv_init() calls would not know which MeshBlock and neighbor recv buffer to use until the message is recv'd.

Yurlungur commented 2 years ago

Reviving this issue given the discussion on the parthenon call today.

lroberts36 commented 2 years ago

Is there some reason not to create a duplicate communicator of MPI_COMM_WORLD for each field that has MetaData::FillGhost and use the tag for just identifying the lids of the sending and receiving processes? It seems like the communicator limit is probably not too much of a limitation in this case. I am not aware of any use case that would have ~8000 separate fields.

Yurlungur commented 2 years ago

Is there some reason not to create a duplicate communicator of MPI_COMM_WORLD for each field that has MetaData::FillGhost and use the tag for just identifying the lids of the sending and receiving processes? It seems like the communicator limit is probably not too much of a limitation in this case. I am not aware of any use case that would have ~8000 separate fields.

I like this idea.

lroberts36 commented 2 years ago

I am just not sure if that has any performance implications. Running some simple tests suggests that it is almost exactly the same, but the tests were pretty idealized.

parthenon-hpc-lab / parthenon

Consider Alternative MPI Communication Tagging Strategies #449