Closed mhoemmen closed 8 years ago
This depends on #179, in that the optimal kernel might use a different layout for blocks.
May be this should be KokkosKernels ?
Regarding to this issue, I just looked at the code. I think that what needs to be done is
Tpetra_Experimental_BlockCrsMatrix_def.hpp
Tpetra_Experimental_BlockView.hpp
Is this the right track to go (just make sure I am not completely doing something wrong).
Couple comments: (1) Yes we should move this into Kokkos Kernels as much as possible. I think I might start a bit on that. (2) We need to get Ryans code and see what he did, since apparently he is supposed to have a well performing Block SPMV. (3) GEMV in this case is probably to small for a team. It would be more something for the vector level parallelization. (4) We still need team parallelization since the algorithm uses temporary storage.
As a subtask we will start doing some BLAS type operations for teams/vector level.
I believe the GEMV has to be thread aware, so there is some coalesced accesses. Is there another way to do this ?
I submit pull request to christian's repo. It passes the unit test on my desktop machine and we need some discussion about the parallelization direction and kernel interfaces. Comments are in the commit log.
(1) Yes we should move this into Kokkos Kernels as much as possible. I think I might start a bit on that.
The only reason it didn't start that way is because I wasn't sure if the interface was right. Still not convinced but at some point one has to commit.
I believe the GEMV has to be thread aware, so there is some coalesced accesses. Is there another way to do this ?
See discussion of #180.
(2) We need to get Ryans code and see what he did, since apparently he is supposed to have a well performing Block SPMV.
Ryan spent most of his effort on the CUDA implementation. It would be interesting to target KNL and see how convergent the two implementations are.
Do we have any performance test ? I only see the timing results from unit tests ( TpetraCore_ExpBlockCrsMatrix_MPI_4) I do not know how to control the number of threads in the unit test (tests do not listen --kokkos-threads cmd arguments). There is some missing components in Kokkos Serial space and it does not compile with it. Comparing the previous version, timing result is almost same.
BTW, I spoke to Christian, GEMV has to be CUDA Threads aware (which also maps to vectorization in CPUs).
Hmmmmmm .... it looks like that this issue will live quite long ~
Why don't we create another one for performance. You have already taken care of what is needed.
How can I turn on Pthread backend as a default node in Tpetra ? When I disable OpenMP and enalbe Threads, it turns on Serial space only.
@kyungjoo-kim this is related to #123
I just found that hwloc prevent from listening OMP_NUM_THREADS and the previous timing result was corrupted by the over-subscription of threads. Without hwloc, I also see that the unit test uses a tiny problem. I found a couple Kokkos problems and will open issues and reference here.
I opened related issues on Kokkos. #198 #200 . Done for now.
@kyungjoo-kim I would have been happy with a single-level loop using the host execution space ;-) Like Siva said, this is just the first step. You're welcome to work on the second paper with Ryan (got the first one accepted at an IPDPS workshop). Thanks though for thinking about this!
This wasn't merged into master, so I'm reopening. @amklinv and I are working on the merge now. We have to use the host execution space for now, because I can't assume that CUDA builds turn on the option for device lambdas. Thus, this won't actually be fixed until we rewrite the lambda into a functor.
Merge is trickier than I had thought. Block RILUK test is failing in Ifpack2, but it may not be related to this mat-vec. I have to experiment a bit first. I may need to add a MultiVector sync for the output MV Y.
The Ifpack2 issues were due to the #177 fix; this commit was fine. See #177 discussion.
Hold on, Ifpack2 RBILUK fails with CUDA now. Working out the details.
@trilinos/tpetra @trilinos/ifpack2 @crtrott @kyungjoo-kim @amklinv
Thread-parallelize the sparse matrix-vector multiply in the apply() method of Tpetra::Experimental::BlockCrsMatrix. Please interact with Ryan Eberhardt, who has an excellent CUDA implementation for column-major blocks.
It would be wise to do this in two passes. First, add a simple host execution space parallelization using a lambda. Then, implement an optimized kernel, using Ryan's as a start.
This affects Ifpack2 as well as Tpetra, because for Jacobi with > 1 sweep, Ifpack2 uses sparse mat-vec.