Parameterize matmult::summa to reduce code bloat

huttered40 commented 5 years ago

matmult::summa is almost 1000 lines due to the separate implementations of the 3D Summa algorithm for gemm, trmm, and syrk. I'm sure there are slight differences, but I'm also confident these can be addressed.

matmult::summa should be parameterized on this local computation policy.

I want to rename _start1 to distribute_bcast and _start2 to distribute_allgather. I want to keep distribute_allgather mainly because it I think it has more overlap potential than the other because it has reshuffling work to do after the MPI_Allreduce that can be done step by step as its Waiting on the MPI_Allreduces to complete. This, combined with its 2x smaller number of bytes communicated, might yield a faster primitive. I need to compare both correctness and scalability against the non-pipelined bcast variant.
What about forcing the Bcasts to take place on node (ranks contiguous), and the MPI_Allreduce to take place non contiguously? Would that improve the benefit of the pipeline, because the synchronizations would take longer?
I might just want to get rid of BroadcastPanels. Now its mainly called from _start1, but its also called directly from the two pointer-based invoke methods. Need to think of a way to prevent code bloat here. Just how much different are the pointer methods from the regular ones? I would really like to just get rid of the pointer-based methods, but I know trsm methods call it.

huttered40 commented 5 years ago

Note that there are also two other invoke methods that take pointers.

What are these used for again? Can I also merge these?

huttered40 commented 5 years ago

Note that there are also two other invoke methods that take pointers.

What are these used for again? Can I also merge these?

I commented these two methods out and apparently they are only called by diaginvert as a convenience. We need to think of a way for diaginvert to call the regular summa::invoke methods.

huttered40 commented 5 years ago

See #21 for more information

huttered40 commented 4 years ago

For new overlap potential in cholesky inverse, add methods initiate_stage1, etc. for initiating nonblocking broadcast along row, column, and summing along depth via MPI_IAllreduce. Have them return a MPI_Request handle (by reference?).

What about loops for waiting one one while initiating another? Should that go simply in cholesky inverse, with summa giving a method(s) close_stage1, etc.?

huttered40 commented 4 years ago

Further, I want to have methods that initiate: 1) a broadcast along columns 2) a broadcast along rows 3) a reduction along depth

These will simply return a handle to the MPI_Request. This might motivate a full break-up of the current methods into smaller pieces.

Here is another thing to consider: Besides the three interface routines for cutting up the matrix, let the true summa engine deal with pointers only. This might get rid of a lot of the dumb code and might make it easier to cut into smaller chunks, like with the proposed methods above.

huttered40 commented 4 years ago

Current file is down to 322 lines, which is a massive improvement. I still need to incorporate nonblocking collective stop-and-start. Note that I also took out the 2 pointer-based interfaces necessary for trsm.

huttered40 commented 4 years ago

Closing this because its basically already done. Any changes related to achieving overlap in choleskyinverse is a separate issue.

huttered40 / capital

Parameterize matmult::summa to reduce code bloat #10