huttered40 / capital

Distributed-memory implementations of novel Cholesky and QR matrix factorizations
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Parameterize matmult::summa to reduce code bloat #10

Closed huttered40 closed 4 years ago

huttered40 commented 5 years ago

matmult::summa is almost 1000 lines due to the separate implementations of the 3D Summa algorithm for gemm, trmm, and syrk. I'm sure there are slight differences, but I'm also confident these can be addressed.

matmult::summa should be parameterized on this local computation policy.

huttered40 commented 5 years ago

Note that there are also two other invoke methods that take pointers.

What are these used for again? Can I also merge these?

huttered40 commented 5 years ago

Note that there are also two other invoke methods that take pointers.

What are these used for again? Can I also merge these?

I commented these two methods out and apparently they are only called by diaginvert as a convenience. We need to think of a way for diaginvert to call the regular summa::invoke methods.

huttered40 commented 5 years ago

See #21 for more information

huttered40 commented 4 years ago

For new overlap potential in cholesky inverse, add methods initiate_stage1, etc. for initiating nonblocking broadcast along row, column, and summing along depth via MPI_IAllreduce. Have them return a MPI_Request handle (by reference?).

What about loops for waiting one one while initiating another? Should that go simply in cholesky inverse, with summa giving a method(s) close_stage1, etc.?

huttered40 commented 4 years ago

Further, I want to have methods that initiate: 1) a broadcast along columns 2) a broadcast along rows 3) a reduction along depth

These will simply return a handle to the MPI_Request. This might motivate a full break-up of the current methods into smaller pieces.

Here is another thing to consider: Besides the three interface routines for cutting up the matrix, let the true summa engine deal with pointers only. This might get rid of a lot of the dumb code and might make it easier to cut into smaller chunks, like with the proposed methods above.

huttered40 commented 4 years ago

Current file is down to 322 lines, which is a massive improvement. I still need to incorporate nonblocking collective stop-and-start. Note that I also took out the 2 pointer-based interfaces necessary for trsm.

huttered40 commented 4 years ago

Closing this because its basically already done. Any changes related to achieving overlap in choleskyinverse is a separate issue.