Tpetra BCRS: Thread-parallelize sparse matrix-vector multiply

mhoemmen commented 8 years ago

@trilinos/tpetra @trilinos/ifpack2 @crtrott @kyungjoo-kim @amklinv

Thread-parallelize the sparse matrix-vector multiply in the apply() method of Tpetra::Experimental::BlockCrsMatrix. Please interact with Ryan Eberhardt, who has an excellent CUDA implementation for column-major blocks.

It would be wise to do this in two passes. First, add a simple host execution space parallelization using a lambda. Then, implement an optimized kernel, using Ryan's as a start.

This affects Ifpack2 as well as Tpetra, because for Jacobi with > 1 sweep, Ifpack2 uses sparse mat-vec.

mhoemmen commented 8 years ago

This depends on #179, in that the optimal kernel might use a different layout for blocks.

srajama1 commented 8 years ago

May be this should be KokkosKernels ?

kyungjoo-kim commented 8 years ago

Regarding to this issue, I just looked at the code. I think that what needs to be done is

Tpetra_Experimental_BlockCrsMatrix_def.hpp

apply parallel for around GEMV

Tpetra_Experimental_BlockView.hpp

apply team-level parallelization inside.

Is this the right track to go (just make sure I am not completely doing something wrong).

crtrott commented 8 years ago

Couple comments: (1) Yes we should move this into Kokkos Kernels as much as possible. I think I might start a bit on that. (2) We need to get Ryans code and see what he did, since apparently he is supposed to have a well performing Block SPMV. (3) GEMV in this case is probably to small for a team. It would be more something for the vector level parallelization. (4) We still need team parallelization since the algorithm uses temporary storage.

crtrott commented 8 years ago

As a subtask we will start doing some BLAS type operations for teams/vector level.

srajama1 commented 8 years ago

I believe the GEMV has to be thread aware, so there is some coalesced accesses. Is there another way to do this ?

kyungjoo-kim commented 8 years ago

I submit pull request to christian's repo. It passes the unit test on my desktop machine and we need some discussion about the parallelization direction and kernel interfaces. Comments are in the commit log.

mhoemmen commented 8 years ago

(1) Yes we should move this into Kokkos Kernels as much as possible. I think I might start a bit on that.

The only reason it didn't start that way is because I wasn't sure if the interface was right. Still not convinced but at some point one has to commit.

mhoemmen commented 8 years ago

I believe the GEMV has to be thread aware, so there is some coalesced accesses. Is there another way to do this ?

See discussion of #180.

mhoemmen commented 8 years ago

(2) We need to get Ryans code and see what he did, since apparently he is supposed to have a well performing Block SPMV.

Ryan spent most of his effort on the CUDA implementation. It would be interesting to target KNL and see how convergent the two implementations are.

kyungjoo-kim commented 8 years ago

Do we have any performance test ? I only see the timing results from unit tests ( TpetraCore_ExpBlockCrsMatrix_MPI_4) I do not know how to control the number of threads in the unit test (tests do not listen --kokkos-threads cmd arguments). There is some missing components in Kokkos Serial space and it does not compile with it. Comparing the previous version, timing result is almost same.

srajama1 commented 8 years ago

BTW, I spoke to Christian, GEMV has to be CUDA Threads aware (which also maps to vectorization in CPUs).

kyungjoo-kim commented 8 years ago

Hmmmmmm .... it looks like that this issue will live quite long ~

srajama1 commented 8 years ago

Why don't we create another one for performance. You have already taken care of what is needed.

kyungjoo-kim commented 8 years ago

How can I turn on Pthread backend as a default node in Tpetra ? When I disable OpenMP and enalbe Threads, it turns on Serial space only.

rppawlo commented 8 years ago

@kyungjoo-kim this is related to #123

kyungjoo-kim commented 8 years ago

I just found that hwloc prevent from listening OMP_NUM_THREADS and the previous timing result was corrupted by the over-subscription of threads. Without hwloc, I also see that the unit test uses a tiny problem. I found a couple Kokkos problems and will open issues and reference here.

kyungjoo-kim commented 8 years ago

I opened related issues on Kokkos. #198 #200 . Done for now.

mhoemmen commented 8 years ago

@kyungjoo-kim I would have been happy with a single-level loop using the host execution space ;-) Like Siva said, this is just the first step. You're welcome to work on the second paper with Ryan (got the first one accepted at an IPDPS workshop). Thanks though for thinking about this!

mhoemmen commented 8 years ago

This wasn't merged into master, so I'm reopening. @amklinv and I are working on the merge now. We have to use the host execution space for now, because I can't assume that CUDA builds turn on the option for device lambdas. Thus, this won't actually be fixed until we rewrite the lambda into a functor.

mhoemmen commented 8 years ago

Merge is trickier than I had thought. Block RILUK test is failing in Ifpack2, but it may not be related to this mat-vec. I have to experiment a bit first. I may need to add a MultiVector sync for the output MV Y.

mhoemmen commented 8 years ago

The Ifpack2 issues were due to the #177 fix; this commit was fine. See #177 discussion.

mhoemmen commented 8 years ago

Hold on, Ifpack2 RBILUK fails with CUDA now. Working out the details.

trilinos / Trilinos

Tpetra BCRS: Thread-parallelize sparse matrix-vector multiply #178