lrbison commented 11 months ago

I'm planning to implement a hierarchical alltoall implementation within the col/han module. I've collected my thoughts in one place to get relevant feedback from the Open MPI community before submitting the patch. I welcome and thoughts you all may have.

RingLeaders: A hierarchical, scalable, alltoall and alltoallv algorithm

Motivation and Related Work

Modern HPC has seen a dramatic increase in core count. Many researchers have realized the potential to optimize collectives by arranging communication into a topology which recognizes the difference in performance between low-latency, shared-memory communicators and higher latency network communicators. Open MPI has capitalized on this via the HAN module (source link), but HAN currently has no dedicated alltoall algorithm.

Incorporating shared memory into alltoall communications has been proposed before. Xu et al, 2013. proposed SLOAVx which uses a special data buffer to minimize data reordering while allowing alltoallv and a tree structure to gather data on-node. This work was tested on a 16-core system.

Prisacari et al, 2013 also explored general hierarchical exchanges, although from the standpoint of networks which may have an inherent topology themselves. They point out the two most popular exchange patterns are an XOR exchange or a linear shift exchange. XOR exchanges are used in the classic Bruck, but our approach will take the linear shift exchange, which can be visualized as each rank exchanging with all other ranks in a ring-like order.

We propose a new algorithm called RingLeaders. This algorithm attempts to batch sends into fewer, larger sends. The motivating observation is that an alltoall on four distinct nodes with a message size of 1MiB takes only 0.3 ms, but it exchanges the same amount of data as an alltoall on the same four nodes with 64 process per node with 1/64th the message size (16 KiB), and that exchange takes 10 ms. While there is some additional memory movement being done for the 256-rank case, this work should not cause 20x performance penalty. Instead the author suspects that existing algorithms suffer latency imbalance, and that the networking hardware and the middleware library can be overloaded handling too many small messages. RingLeaders attempts to be more efficient by sending fewer, larger messages over the network, and utilizing fast shared memory exchanges without introducing communication imbalance.

Algorithm Description

Let us start with a simple test case. We assume that we wish to capture alltoall on $N_l=4$ nodes with $N_f=64$ processes per node ($P=256$) with a message size of $m=1$. Let us also assume we can easily split the communicator into intranode groups, and each intranode group can elect a leader to perform internode communications for it's intranode group (its "local followers"). The basic algorithm occurs in 3 steps:

Gather data from local followers to each leader
Exchange data between leaders
Scatter data from leader to local followers.

A simple out-of-place implementation is to have the leader allocate two buffers: enough to hold all of its follower's send and receive buffers. This means the node will require $4 N_f P$ memory during the exchange: the send/receive distributed across all ranks, and the extra send/receive within the leader, while an in-place algorithm would only require $N_f P$. For this simple case, step one is a trivial Gather, and step two is a trivial Scatter. For the moment, let us assume the leader data exchange is a naïve, fully-connected set of sends and receives.

This simple implementation has two shortfalls: (1) holding 4x memory may not be feasible, and (2) our naïve leader exchange doesn't scale efficiently.

We can tackle both of these at once by introducing rounds outside of our three steps. We propose a Linear shift exchange visualized by a ring topology composed of the leaders. The exchange function is symmetric, so that after the exchange, no additional buffers are needed on either the leader or the followers. This helps reduce memory pressure and allows a nearly-in-place exchange, and reduce concurrent tasks on the NIC. We also introduce a fan-out parameter, so that we can tune to a good balance between memory consumption and concurrent exchanges on the NIC.

The partner selection function is constructed so that all exchanges are reciprocal. The minimum number of pair-wise exchanges required is the size of the off-diagonal upper triangular matrix of size P: $\frac{P^2 - P}{2}$. The minimum number of rounds required to execute that many exchanges on P nodes is $P-1$ rounds, however if we require each rank have only one partner per round, it can be trivially seen that $P-1=2$ rounds is not sufficient to complete a 3-way alltoall, however our partner selection algorithm is able to complete all exchanges in $P$ rounds. The fan-out parameter $f$ is implemented simply by completing $f$ exchanges during a single round, so the number of rounds is determined by $\lceil \frac{P}{N_l f}\rceil$. Partner selection for exchange $j$ is simply: $n = (j - r) \mod P$

The choice of $f$ is a tradeoff between memory usage and network utilization. At each round the leader needs only to reserve enough memory to exchange during that round. For an in-place exchange the followers need no additional buffers. Total node memory usage during such an exchange can be reduced to $N_f P (1 + 2\frac{f}{N_l})$.

Additional considerations:

To help network devices directly access memory in multi-round exchanges, the leader should allocate the send/receive buffers from a registered memory pool, or attempt to register them after allocation. This way each round re-uses the registration from the previous round.

In situations where not all leaders have the same number of followers, the exchange completed by the leaders is effectively an alltoallv exchange, even if the ranks each call alltoall. This situation must be detected, and more generally, each leader must know the ranks of (a) all the other leaders and (b) the followers each leader has. This may introduce an additional (small) network exchange amongst the leaders in both alltoallv and alltoall collectives.

Alltoallv support can be added with additional data being passed from followers to the leader, without needing to add extra network exchanges. The followers report the send and receive counts to the leader, who aggregates them before the leader exchange. In this case the leader may need to determine which round will have the maximum send and receive size, and allocate buffers accordingly.

Psudo code:

execute comm split to find local ranks
leader is local rank with minimum rank-id.
if self is leader:
    determine other leaders
    send list of followers to other leaders
    broadcast remote leaders and followers to local followers
    allocate send/recv buffers
    rounds = Nl / f
    for each round:
        for exchanges in f:
            gather data from followers
            isend to partner
            irecv from partner
        get send/receive completions
        scatter data to follower
    free send/recv buffers
if self is not leader:
    receive broadcast of remote leaders/followers
    for each round:
        for exchanges in f:
            gather data to leader
        receive scatter data from leader

lrbison commented 11 months ago

As I'm starting to implement this, I realized that HAN will disqualify itself in cases where ranks are not equally distributed across nodes. This simplifies the logic considerably.

I also realized the "HAN way" is to have an upper and lower module implementing part of the collective. However since alltoall is doing pairwise or a few pairwise exchanges in upper, I don't think the upper module is applicable.

lrbison commented 11 months ago

@bosilca Do you have any feedback on this idea?

lrbison commented 11 months ago

I talked to both @bosilca and @qkoziol about this offline. They've both given me some things to think about:

Try to avoid the leader's buffer. It might be that it's more efficient to have the "pair" selection done as I've described, but then have the follower nodes send messages directly to other follower nodes. This can be further improved by leveraging a rolling number of requests (as is done in https://github.com/open-mpi/ompi/pull/11628)
My initial implementation does a helpful memory-packing operation before send, which might be helpful.
It seems like once message size > MTU size, the aggregation I'm proposing may no longer be helpful, but there might be other middleware or reasons many smaller messages could be slow.

open-mpi / ompi

RFC: A hierarchical alltoall and alltoallv #12001

RingLeaders: A hierarchical, scalable, alltoall and alltoallv algorithm

Motivation and Related Work

Algorithm Description