Closed huttered40 closed 4 years ago
The current status here is: I have implemented nonblocking collective support for summa and the base case of cholinv. The base case of cholinv is not correct for num_chunks>1, but I am also not sure whether or not to invest time in fixing this bug, because the performance might just be worse with more than 1 chunk. This relates the question as to whether a memory bandwidth-bound code segment can be effectively overlapped with interprocessor communication.
I'm currently thinking that the summa calls in cholinv should be encapsulated inside the policy classes, and each should return a MPI_Request if necessary.
We need to first identify the places in which initiations need to take place, and then define a method inside the policy classes for each of them, where ones not pertaining will be just an empty implementation of 0 cost.
I want to first note that we see significant speedups by using num_chunks=2
in cholinv for an 8192x8192 matrix (form 1.25 to 1.05).
Notice that there is overlap opportunity even if we do not set complete_inv
if we use cholesky factorization as a subroutine in cacqr. Not sure how we could support this nicely though, because cholesky factorization shouldn't really know that it was called by a higher-level algorithm.
No benefit. Abandon.
Here are a few scenarios that I'd like to try that leverage pipelined nonblocking collectives in an effort to increase network utilization and avoid synchronization overhead:
1) Summa with a 3d schedule: I'd like to try the following approaches:
2) In the base case of cholinv, I'd like to pipeline the MPI_Allgather along with the computation of moving the data to the right place. Those loops are nasty and I think this might yield good benefit, if the code is carefully written so as to maintain correctness.
3) In general, if cholinv is at recursion depth k, I'd like to try a few pipeline depth progressions:
4) - I want to try pipelining
distribute_allgather
inmatmult::summa
because it I think it has more overlap potential than the other because it has reshuffling work to do after the MPI_Allreduce that can be done step by step as its Waiting on the MPI_Allreduces to complete. This, combined with its 2x smaller number of bytes communicated, might yield a faster primitive. I need to compare both correctness and scalability against the non-pipelined bcast variant. And then what about forcing the Bcasts to take place on node (ranks contiguous), and the MPI_Allreduce to take place non contiguously? Would that improve the benefit of the pipeline, because the synchronizations would take longer?5) I think there might be more general questions about using nonblocking collectives to overlap things in
cholinv
, but this only makes sense if the primitives discussed above show promise.As for software infrastructure, I'd like each of the topology classes to take a default argument for the pipeline depth. The default will be 0, which indicates that no pipelining is to occur and blocking collectives are to be used. Positive values of k indicates to chop the message into k contiguous chunks (check for size not being evenly divisible) and use a pipeline of length k.