Tpetra: Isolate sparse matrix-matrix multiply local kernel

mhoemmen commented 8 years ago

@trilinos/tpetra @mndevec @srajama1 @csiefer2

In Tpetra, we need to isolate the "local" part of the sparse matrix-matrix multiply kernel from the "global" (MPI communication) part. This will facilitate thread parallelization and the use of third-party libraries' implementations of the local kernel.

mhoemmen commented 8 years ago

MueLu sometimes actually wants to compute the Jacobi update $C = (I - \omega D^{-1} A) B$.

csiefer2 commented 8 years ago

We need both Jacobi & SpMM. They're basically the same kernel if you're smart about it.

mhoemmen commented 8 years ago

Andrey just explained to me that muelu/src/Misc/MueLu_RAPFactory_def.hpp uses Multiply, so yes :)

mhoemmen commented 8 years ago

A*P uses mult_A_Bnewmatrix with B = P, which does $C := A (B{local} + B_{remote})$.

mhoemmen commented 8 years ago

I'm looking at mult_A_B_newmatrix now. c_status is just something used internally. There are additional data structures that map between local indices of A, and local indices of B_local and B_remote (see above):

Bk = targetMapToOrigRow[Ak] is valid if the local column of A, A.ind[k], is a local row of B_local. In that case, Bk is the current local row of B_local at which to look. Otherwise, Ik = targetMapToImportRow[Ak] is the current local row of B_remote at which to look.
Bcol2col[B_local.ind[j]]
Icol2col[B_remote.ind[j]]

mhoemmen commented 8 years ago

The output matrix C of mult_A_B_newmatrix gets its column Map and Import as the union ("setUnion") of the column Maps resp. Imports of B_local and B_remote.

    // NOTE: This is not efficient and should be folded into setUnion
    Icol2Ccol.resize(Bview.importMatrix->getColMap()->getNodeNumElements());
    ArrayView<const GlobalOrdinal> Bgid = Bview.origMatrix->getColMap()->getNodeElementList();
    ArrayView<const GlobalOrdinal> Igid = Bview.importMatrix->getColMap()->getNodeElementList();
    for(size_t i=0; i<Bview.origMatrix->getColMap()->getNodeNumElements(); i++)
      Bcol2Ccol[i] = Ccolmap->getLocalElement(Bgid[i]);
    for(size_t i=0; i<Bview.importMatrix->getColMap()->getNodeNumElements(); i++)
      Icol2Ccol[i] = Ccolmap->getLocalElement(Igid[i]);

aprokop commented 8 years ago

I just want to document that I've tried a slightly different variant to speed things up, but was not successful. The thing I tried was to accumulate products and then merge together.

https://gist.github.com/aprokop/05ba603a934c7a96ac80

I would like for us to discuss what kind of improvements we may want to consider, and what is the path forward for multithreaded MMM.

@mhoemmen @mndevec @csiefer2

srajama1 commented 8 years ago

Mehmet and I have an algorithm worked out that we expect to be "performance portable". Mehmet is implementing that. We are expecting Chris & Chris to take care of the distributed memory stuff.

aprokop commented 8 years ago

During the dungeon I wrote a quick serial wrapper for Kokkos experimental graph wrappers in MMM. While I don't have numbers any longer, running the KK1 version was many times slower than the existing implementation for a single thread.

srajama1 commented 8 years ago

Not the existing one, that is a preliminary implementation we started with so we have something there. The new one will have two steps, symbolic and numeric. You want it faster than we can develop it. Is there some urgency on this ? We were planning for late April or Early May.

Someone wants to isolate the local kernel before then, that is ok. But the threaded kernel on all three platforms that we care will take some work.

aprokop commented 8 years ago

No particular urgency, just a general desire to see faster MMM. Is there a way I can look at the newest implementation/algorithm?

srajama1 commented 8 years ago

We can explain our current thinking in person with a board. The primary trouble is getting it to work in GPUs.

mndevec commented 8 years ago

@aprokop I dont have a complete implementation yet for Matrix Multiplication. I only have interfaces to MKL, cuSPARSE and CUSP. Which one did you compare against the sequential?

aprokop commented 8 years ago

I tried SPGEMM_MKL and SPGEMM_KK1. KK1 was the slow one, I believe.

mndevec commented 8 years ago

KK1 is incomplete, it does not have multiplication. It shouldn't produce any result.

aprokop commented 8 years ago

I did not check for what it did, I just checked the runtime when running with an example in tpetra.

mhoemmen commented 8 years ago

@jhux2 @csiefer2 @trilinos/muelu Please tell us which sparse matrix-matrix multiply routines matter most.

mult_A_B_newmatrix?
jacobi_A_B_newmatrix?
What about *_reuse?

mhoemmen commented 8 years ago

@csiefer2 @jhux2 Also, is there some reason why mult_A_B_newmatrix and jacobi_A_B_newmatrix use entirely different local sparse matrix-matrix multiply algorithms?

aprokop commented 8 years ago

@mhoemmen

Please tell us which sparse matrix-matrix multiply routines matter most.

mult_A_B_newmatrix? jacobi_A_B_newmatrix? What about *_reuse?

newmatrix all matter. _reuse matters significantly less as the MueLu reuse has not been heavily used yet.

csiefer2 commented 8 years ago

@mhoemmen: mult_A_B_newmatrix and jacobi_A_B_newmatrix use almost identical local algorithms. The only difference is whether or not you have the Jacobi part. The code is cut and paste because there was no good way to do it all in one routine that I could think of.

ambrad commented 8 years ago

@aprokop, now that RILUK supports reuse properly and efficiently, does that change the importance of reuse? Or are there other blockers to effective reuse in MueLu?

aprokop commented 7 years ago

@ambrad I have not looked at the recent RILUK changes, but if as you say it does the proper separation of symbolic() and numeric(), it definitely makes it easier. If everything on the path to do AdditiveSchwarz with overlapping subdomains and RILUK does the correct separation of symbolic() and numeric() phases, I believe the proper reuse implementation in MueLu can be written quickly.

One of my concerns is the overlapping matrix part where I think it does the import during the Setup() phase.

mhoemmen commented 7 years ago

I'm going to close this, because it kind of happened.

trilinos / Trilinos

Tpetra: Isolate sparse matrix-matrix multiply local kernel #148