Closed huttered40 closed 4 years ago
Note that there are also two other invoke
methods that take pointers.
What are these used for again? Can I also merge these?
Note that there are also two other
invoke
methods that take pointers.What are these used for again? Can I also merge these?
I commented these two methods out and apparently they are only called by diaginvert
as a convenience. We need to think of a way for diaginvert
to call the regular summa::invoke
methods.
See #21 for more information
For new overlap potential in cholesky inverse, add methods initiate_stage1
, etc. for initiating nonblocking broadcast along row, column, and summing along depth via MPI_IAllreduce. Have them return a MPI_Request handle (by reference?).
What about loops for waiting one one while initiating another? Should that go simply in cholesky inverse, with summa giving a method(s) close_stage1
, etc.?
Further, I want to have methods that initiate: 1) a broadcast along columns 2) a broadcast along rows 3) a reduction along depth
These will simply return a handle to the MPI_Request. This might motivate a full break-up of the current methods into smaller pieces.
Here is another thing to consider: Besides the three interface routines for cutting up the matrix, let the true summa
engine deal with pointers only. This might get rid of a lot of the dumb code and might make it easier to cut into smaller chunks, like with the proposed methods above.
Current file is down to 322 lines, which is a massive improvement. I still need to incorporate nonblocking collective stop-and-start. Note that I also took out the 2 pointer-based interfaces necessary for trsm.
Closing this because its basically already done. Any changes related to achieving overlap in choleskyinverse is a separate issue.
matmult::summa
is almost 1000 lines due to the separate implementations of the 3D Summa algorithm forgemm
,trmm
, andsyrk
. I'm sure there are slight differences, but I'm also confident these can be addressed.matmult::summa
should be parameterized on this local computation policy._start1
todistribute_bcast
and_start2
todistribute_allgather
. I want to keepdistribute_allgather
mainly because it I think it has more overlap potential than the other because it has reshuffling work to do after the MPI_Allreduce that can be done step by step as its Waiting on the MPI_Allreduces to complete. This, combined with its 2x smaller number of bytes communicated, might yield a faster primitive. I need to compare both correctness and scalability against the non-pipelined bcast variant.BroadcastPanels
. Now its mainly called from_start1
, but its also called directly from the two pointer-based invoke methods. Need to think of a way to prevent code bloat here. Just how much different are the pointer methods from the regular ones? I would really like to just get rid of the pointer-based methods, but I knowtrsm
methods call it.