Experiment with nonblocking collectives

huttered40 commented 4 years ago

Here are a few scenarios that I'd like to try that leverage pipelined nonblocking collectives in an effort to increase network utilization and avoid synchronization overhead:

1) Summa with a 3d schedule: I'd like to try the following approaches:

current approach, which performs two MPI_Bcasts in succession, followed by a MPI_Allreduce after the local matrix multiplication
fuse the two broadcasts when pipeline depth is 1 using nonblocking collectives
for pipeline depth >1, fuse and chop into chunks and use MPI_Wait to wait on them (or MPI_Waitall? Because we don't need to do any extra overlapping communication besides the broadcasts, I think this would be equivalent).

2) In the base case of cholinv, I'd like to pipeline the MPI_Allgather along with the computation of moving the data to the right place. Those loops are nasty and I think this might yield good benefit, if the code is carefully written so as to maintain correctness.

3) In general, if cholinv is at recursion depth k, I'd like to try a few pipeline depth progressions:

pipeline of length k
pipeline of length 2^k. This is because the further we recurse, the smaller messages we have, and the the larger potential there is for poor network utilization (I think? Still seems weird that the smallest messages would have the most to benefit from being cut up even smaller).
pipeline of length 2^-k (meaning that the first recursion level has recursion depth log(P^⅔), and it keeps decreasing by a factor of 2).

4) - I want to try pipelining distribute_allgather in matmult::summa because it I think it has more overlap potential than the other because it has reshuffling work to do after the MPI_Allreduce that can be done step by step as its Waiting on the MPI_Allreduces to complete. This, combined with its 2x smaller number of bytes communicated, might yield a faster primitive. I need to compare both correctness and scalability against the non-pipelined bcast variant. And then what about forcing the Bcasts to take place on node (ranks contiguous), and the MPI_Allreduce to take place non contiguously? Would that improve the benefit of the pipeline, because the synchronizations would take longer?

5) I think there might be more general questions about using nonblocking collectives to overlap things in cholinv, but this only makes sense if the primitives discussed above show promise.

Try to overlap all reductions as follows: In first calculating L21, we can overlap the broadcast of A11 with the p2p communication for the transpose of L11^-1. And then overlap in the broadcast of the communicated piece of L_11^-1. Then after local computation, we initiate the AllReduce, while also initiating the Bcast of L_11^-1. Then when we collect the AllReduce, we also initiate the broadcast of L_21 before we recurse. This ensures that this communication routine will be active as we recurse, for better or for worse I am not sure yet. Finally, we perform the AllReduce at the end of the recursive level to attain L21^-1, which will be overlapped by another communication for all but the last recursive level.

As for software infrastructure, I'd like each of the topology classes to take a default argument for the pipeline depth. The default will be 0, which indicates that no pipelining is to occur and blocking collectives are to be used. Positive values of k indicates to chop the message into k contiguous chunks (check for size not being evenly divisible) and use a pipeline of length k.

huttered40 commented 4 years ago

The current status here is: I have implemented nonblocking collective support for summa and the base case of cholinv. The base case of cholinv is not correct for num_chunks>1, but I am also not sure whether or not to invest time in fixing this bug, because the performance might just be worse with more than 1 chunk. This relates the question as to whether a memory bandwidth-bound code segment can be effectively overlapped with interprocessor communication.

huttered40 commented 4 years ago

I'm currently thinking that the summa calls in cholinv should be encapsulated inside the policy classes, and each should return a MPI_Request if necessary.

We need to first identify the places in which initiations need to take place, and then define a method inside the policy classes for each of them, where ones not pertaining will be just an empty implementation of 0 cost.

huttered40 commented 4 years ago

I want to first note that we see significant speedups by using num_chunks=2 in cholinv for an 8192x8192 matrix (form 1.25 to 1.05).

huttered40 commented 4 years ago

Notice that there is overlap opportunity even if we do not set complete_inv if we use cholesky factorization as a subroutine in cacqr. Not sure how we could support this nicely though, because cholesky factorization shouldn't really know that it was called by a higher-level algorithm.

huttered40 commented 4 years ago

No benefit. Abandon.

huttered40 / capital

Experiment with nonblocking collectives #22