Performance of looping over Tpetra CrsMatrix rows

aprokop commented 8 years ago

@trilinos/tpetra @jhux2 @mhoemmen @crtrott

Let me first admit that I am very likely doing something wrong.

I wrote a simple driver (located at muelu/test/perf_test_kokkos, which essentially finds the number of nonzeros in a CrsMatrix by looping through rows and adding lengths. It considers three scenarios:

Looping through Xpetra layer abstraction (something MueLu is very interested in)
Looping directly through Tpetra/Epetra
Looping through the local Kokkos CrsMatrix

The results were somewhat unexpected for me. I was running with a single MPI rank with OpenMp OMP_NUM_THREADS=1, disabled HWLOC (so that Kokkos respects this). Here are some results: For Tpetra

Loop #1: Xpetra/Tpetra  0.05980 (1)                 
Loop #2: Tpetra             0.05867 (1) 
Loop #3: Kokkos-1         0.00274 (1)               
Loop #4: Kokkos-2         0.00214 (1)

For Epetra

Loop #1: Xpetra/Epetra  0.01933 (1)                
Loop #2: Epetra             0.01385 (1)                
Loop #3: Kokkos-1         0.00427 (1)               
Loop #4: Kokkos-2         0.00213 (1)

So it seems to me that using local Kokkos matrix has absolutely be the way, as it is ~30 times faster than through Tpetra, and ~6 times faster than through Epetra.

I would like to know if anybody done any performance studies like this, or what could be the reason. If I am doing something that is completely wrong, I would also like to know that.

mhoemmen commented 8 years ago

Hi Andrey! Thanks for doing the comparison! btw is the matrix fill complete at this point? (It must be locally indexed, since you're getting local indices.)

My main concern is that Tpetra is nearly 3x slower than Epetra for this very simple operation. My guess is that Tpetra spends its extra time in Teuchos::ArrayView construction and manipulation. Both Epetra and Tpetra do error checking by throwing exceptions, and both have a similar division of labor between the graph and the matrix (e.g., the matrix gets the row's indices by calling a method on its graph). Besides the use of ArrayView, the main difference is that Epetra tends to branch on the matrix's current state ("is storage optimized?"), while Tpetra abstracts that away behind CrsGraph::getRowInfo (returns a struct with the row's current number of entries, amount of space, and offset if applicable). It could be that filling the struct wastes time vs. just branching, but I'm sure the ArrayView manipulation plays a big part. It's possible to strip all that away and just return to extracting and passing around raw pointers (except for the underlying Kokkos Views that actually hold the data).

Epetra could be optimized too. For example, 4-argument Epetra_CrsMatrix::ExtractMyRowView asks for the number of entries in the row twice (once in Epetra_Graph::ExtractMyRowView, and once in 3-argument Epetra_CrsMatrix::ExtractMyRowView). On the other hand, the Epetra and Tpetra interfaces do error checking and look up things for you, and you'll always have to pay something for that.

A while back, @nmhamster and I debated a bit about this. He argued that the main interface we give to users should always perform reasonably well. Otherwise, we shouldn't give it to users. Do we want to spend more effort making Tpetra's interface perform well, or do we just want to skip it?

For future reference:

Epetra_CrsMatrix::ExtractMyRowView lives in Epetra_CrsMatrix.cpp, lines 1577-1585
Epetra_CrsMatrix::Values lives in Epetra_CrsMatrix.h, lines 1475-1480

mhoemmen commented 8 years ago

Do we want to spend more effort making Tpetra's interface perform well, or do we just want to skip it?

What I mean specifically, is whether we expect users to access the matrix's entries through Tpetra methods, or whether they should always use the Kokkos interface. What I worry about with the latter, is that users will start asking for features that will make the latter more like the former, which will reduce performance, leading again to requests for a stripped-down interface. I think what @nmhamster was trying to say is that if we give users a higher-level interface, we should commit to making it reasonably fast.

etphipp commented 8 years ago

I would suggest running these tests through vtune to see where the differences are. At the conclusion of the milestone we had the performance between Epetra and Tpetra very similar for this operation. But that was pre-Kokkos and pre-KokkosRefactor.

-Eric

On Jan 29, 2016, at 11:16 PM, Mark Hoemmen notifications@github.com<mailto:notifications@github.com> wrote:

Hi Andrey! Thanks for doing the comparison! btw is the matrix fill complete at this point? (It must be locally indexed, since you're getting local indices.)

My main concern is that Tpetra is nearly 3x slower than Epetra for this very simple operation. My guess is that Tpetra spends its extra time in Teuchos::ArrayView construction and manipulation. Both Epetra and Tpetra do error checking by throwing exceptions, and both have a similar division of labor between the graph and the matrix (e.g., the matrix gets the row's indices by calling a method on its graph). Besides the use of ArrayView, the main difference is that Epetra tends to branch on the matrix's current state ("is storage optimized?"), while Tpetra abstracts that away behind CrsGraph::getRowInfo (returns a struct with the row's current number of entries, amount of space, and offset if applicable). It could be that filling the struct wastes time vs. just branching, but I'm sure the ArrayView manipulation plays a big part. It's possible to strip all that away and just return to extracting and passing around raw pointers (except for the underlying Kokkos Views that actuall y hold the data).

Epetra could be optimized too. For example, 4-argument Epetra_CrsMatrix::ExtractMyRowView asks for the number of entries in the row twice (once in Epetra_Graph::ExtractMyRowView, and once in 3-argument Epetra_CrsMatrix::ExtractMyRowView). On the other hand, the Epetra and Tpetra interfaces do error checking and look up things for you, and you'll always have to pay something for that.

A while back, @nmhamsterhttps://github.com/nmhamster and I debated a bit about this. He argued that the main interface we give to users should always perform reasonably well. Otherwise, we shouldn't give it to users. Do we want to spend more effort making Tpetra's interface perform well, or do we just want to skip it?

For future reference:

Epetra_CrsMatrix::ExtractMyRowView lives in Epetra_CrsMatrix.cpp, lines 1577-1585
Epetra_CrsMatrix::Values lives in Epetra_CrsMatrix.h, lines 1475-1480

Reply to this email directly or view it on GitHubhttps://github.com/trilinos/Trilinos/issues/118#issuecomment-177084548.

aprokop commented 8 years ago

Hi Andrey! Thanks for doing the comparison! btw is the matrix fill complete at this point? (It must be locally indexed, since you're getting local indices.)

Yes, it is fill complete.

A while back, @nmhamster and I debated a bit about this. He argued that the main interface we give to users should always perform reasonably well. Otherwise, we shouldn't give it to users. Do we want to spend more effort making Tpetra's interface perform well, or do we just want to skip it?

At this point I think it would be a misguided effort to work on the older Tpetra's interface. As Kokkos is our future, I think in Trilinos we would be in a good shape if we just start skipping directly to Kokkos objects. We just need to let people know that this could greatly benefit them for certain computational workflows.

What I mean specifically, is whether we expect users to access the matrix's entries through Tpetra methods, or whether they should always use the Kokkos interface. What I worry about with the latter, is that users will start asking for features that will make the latter more like the former, which will reduce performance, leading again to requests for a stripped-down interface. I think what @nmhamster was trying to say is that if we give users a higher-level interface, we should commit to making it reasonably fast.

I see the point. I think there is some uncertainty to it, as Tpetra's API supports both MPI and local operations. I think local API that can be done in terms of Kokkos should be done by it, and the API that requires accessing the communicator should be done by Tpetra. But I don't have as good picture of Tpetra as you do, so take it with a grain of salt.

As for MueLu, my plan is to work on moving to Kokkos Refactor branch asap, as even with a single thread we get benefits. The reason I even started thinking about the issue is that I tried to compare the original and Kokkos Refactor variants of MueLu, and noticed the difference in a factory that just detected Dirichlet rows in a matrix.

aprokop commented 8 years ago

I would suggest running these tests through vtune to see where the differences are. At the conclusion of the milestone we had the performance between Epetra and Tpetra very similar for this operation. But that was pre-Kokkos and pre-KokkosRefactor.

If I find time, I'll do the VTune run. You are right about the milestone, as I believe performance of, say, Gauss-Seidel was the same for Epetra/Tpetra.

mhoemmen commented 8 years ago

As Kokkos is our future, I think in Trilinos we would be in a good shape if we just start skipping directly to Kokkos objects.

If "we" means Trilinos developers, I think that's acceptable. The main issue is that the features many users want in the Tpetra interface, are exactly the features that are missing or weakly implemented in the Kokkos interface. I'm thinking particularly of insert* (doesn't exist in Kokkos) and sumInto* / replace* (Kokkos' search hasn't been optimized). I don't just want us to move features from Tpetra to Kokkos without thought (see Kokkos' current sumIntoValues).

We can think of Tpetra's current interface as a vehicle for understanding how to improve fill performance for any interface. I'm not committed to it, other than for backwards compatibility. But it exists, and we can study it and learn how to improve it. Furthermore, enriching the Kokkos interface will help improve Tpetra fill performance, by stripping away complexity and letting Kokkos take charge of local data structures. (Tpetra::CrsMatrix semantics are a mess. I don't think we have ever untangled all the possible states.)

mhoemmen commented 8 years ago

... as I believe performance of, say, Gauss-Seidel was the same for Epetra/Tpetra.

Gauss-Seidel is a different case; it does not use the Tpetra interface for accessing matrix entries, but talks directly to the low-level data structures. I was never happy about putting a public Gauss-Seidel method in Tpetra, but it worked.

maherou commented 8 years ago

I think it is a good separation of concerns to let Kokkos do what it does well. Also, it reduced direct user interaction with Tpetra and reduces Tpetra-Kokkos direct interactions.

On Jan 30, 2016, at 9:50 PM, Mark Hoemmen notifications@github.com wrote:

As Kokkos is our future, I think in Trilinos we would be in a good shape if we just start skipping directly to Kokkos objects.

If "we" means Trilinos developers, I think that's acceptable. The main issue is that the features many users want in the Tpetra interface, are exactly the features that are missing or weakly implemented in the Kokkos interface. I'm thinking particularly of insert* (doesn't exist in Kokkos) and sumInto* / replace* (Kokkos' search hasn't been optimized). I don't just want us to move features from Tpetra to Kokkos without thought (see Kokkos' current sumIntoValues).

We can think of Tpetra's current interface as a vehicle for understanding how to improve fill performance for any interface. I'm not committed to it, other than for backwards compatibility. But it exists, and we can study it and learn how to improve it. Furthermore, enriching the Kokkos interface will help improve Tpetra fill performance, by stripping away complexity and letting Kokkos take charge of local data structures. (Tpetra::CrsMatrix semantics are a mess. I don't think we have ever untangled all the possible states.)

— Reply to this email directly or view it on GitHub.

mhoemmen commented 8 years ago

I think it is a good separation of concerns to let Kokkos do what it does well.

I agree. It will take me 2-3 months of completely uninterrupted time to unravel Tpetra's semantics so I don't break backwards compatibility while pushing functionality into Kokkos, but I still agree.

Also, it reduced direct user interaction with Tpetra and reduces Tpetra-Kokkos direct interactions.

Could you comment further on these points?

maherou commented 8 years ago

If Kokkos is used directly, then Tpetra is no longer in the middle between the user and Kokkos, so user have fewer Tpetra calls and Tpetra has fewer Kokkos calls.

Mike

On Jan 31, 2016, at 2:10 AM, Mark Hoemmen notifications@github.com wrote:

Also, it reduced direct user interaction with Tpetra and reduces Tpetra-Kokkos direct interactions.

Could you comment further on these points?

— Reply to this email directly or view it on GitHub.

nmhamster commented 8 years ago

@maherou, Mike, for many years we (SNL) have advocated strongly for abstraction in the software stack to maintain flexibility. I think this has been to our advantage. If we directly blend Kokkos and Tpetra are we losing that ability into the future? In our discussions with users during the codesign sessions, most times this has come up we have seen users express a want for Teptra interfaces over structures etc. One argument has been that not everyone buys into Kokkos, especially outside of Sandia, for a number of reasons, by having clear interfaces that are abstracted out they can continue to use their alternatives inside their applications. There may be some relatively small performance overheads (small as long as we get most of the inlining correct) to this approach but longer term the abstraction may provide greater flexibility?

srajama1 commented 8 years ago

Just a note of caution : Not a lot of Tpetra functionality is needed to be pushed to the Kokkos level. Fixing this performance regression could be an immediate need.

In the long term, I would argue petra abstractions are too monolithic for the regime of massive node level parallelism. (Eg : FillComplete(), All-or-nothing atomics, Poor support for asynchronous operations, tasking etc ). This needs not a refactor but complete rethinking of the design.

crtrott commented 8 years ago

One comment here about expected slowdowns of Tpetra vs direct usage of Kokkos. Generally the Tpetra interface is intended to handle a distributed memory object. I believe that that means it will always have a much higher overhead, since in generally it requires more error checks and other safe guards (like a richer state set) then a purely local object. Those things cost. The question in my mind is whether Tpetra objects should have "call through" functions to the underlying local Kokkos objects which omit some of the safe guards, or if its better to say: if you are doing a purely local operation, get the local objects to perform them.

nmhamster commented 8 years ago

It seems a good API approach to design for calls through into local objects without requiring explicit access to the Kokkos objects. Good abstraction has in many ways given us the ability to really use products like Kokkos. By passing this to avoid performance slowdowns seems to me the wrong long term path without more analysis on why the code is slower, how we can minimize the overhead or redesign the API to maintain the abstraction. This may not be achievable in the end but I would think the aggressive use of inlining we see in many compilers should mean we get down to raw calls quite quickly where the API design allows for that (we certainly see this in many Kokkos compiles already).

tawiesn commented 8 years ago

Tpetra and Xpetra allow to access the underlying local Kokkos objects (e.g. the Kokkos::CrsMatrix through the getLocalMatrix routine in Tpetra/Xpetra). Andrey is using them for his experiments. In my opinion, from a user's point of view, accessing the matrix/vector data through the local Kokkos objects is not very complicated. It's different if we're talking about matrix construction: One can locally fill the Kokkos::CrsMatrix objects (using parallel for, etc.) and then provide them in the constructor for generating the corresponding Tpetra/Xpetra::CrsMatrix object. Doing this right is extremely complicated as you need expert knowledge about how Tpetra organizes the global and local indices. For example, the user has to take care of which entries have to be communicated through MPI and put them in the right order after the local matrix entries. This is the exact opposite of what the user wants to do (especially when used to insertLocal and sumInto like routines!).

Long in short: i think one has to distinguish between the use cases: 1) access underlying data and change entries (without changing the communication pattern) is quite easy. One can use the already existing "call through" functions (e.g. getLocalMatrix or getHostLocalView). Of course, one can try to speed up the existing access routines (that are using ArrayViews). I rather would warn people using them and move to Kokkos (we have to keep them for backwards compatibility though). The code looks different of course (one has to port it), but the logic stays the same. It's up to the user whether he wants to do that (and gain performance) or just stay with the old style slow code. It might help and motivate users to migrate to Kokkos! 2) create new Tpetra matrices (setting up MPI communication) using local Kokkos objects can be done by experts. Not sure whether the API could be more user friendly (somehow one has to define the column map...). With the current status most of the users will be overwhelmed.

mhoemmen commented 8 years ago

btw I really appreciate that we're having a good discussion about this. Sometimes I feel like we don't have enough of these sorts of discussions. Also, there's nothing personal about this for me. When I first started working on Tpetra, I prioritized correctness and backwards compatibility, taking lessons from Mike's TUG and other talks on the subject. This meant retaining some perhaps unfortunate design decisions, like forcing Tpetra's users to deal with the Teuchos Memory Management classes when filling a matrix. Epetra hides MPI for good reasons; Tpetra should have done the same with Teuchos stuff, where possible.

mhoemmen commented 8 years ago

The question in my mind is whether Tpetra objects should have "call through" functions to the underlying local Kokkos objects which omit some of the safe guards, or if its better to say: if you are doing a purely local operation, get the local objects to perform them.

We don't actually have to resolve this now. If we enrich the Kokkos interface with exactly what it needs for local operations, then it will be easy for Tpetra to invoke that interface. Stripping unnecessary complexity out of Tpetra's innards will make Tpetra's interface faster too. It will also help us figure out whether Tpetra has necessary complexity that will always make filling into Tpetra a lot slower than filling into Kokkos.

Remember that some of Tpetra's internal complexity comes from the Petra Object Model. For example, the POM requires that multiple matrices be able to share the same graph. Both Epetra and Tpetra reflect this in their implementations. Christian and I spent a lot of effort preserving semantics when we did the Kokkos refactor of Tpetra. Think about how much trouble it was for apps and other Trilinos packages when we changed MultiVector from copy to view semantics to make it act more like Kokkos. Now imagine we change a bunch of other things too. (Remember the idiomatic way to test whether an Epetra matrix is empty on a process -- ask if it is neither locally nor globally indexed? This is inextricably tied to lazy allocation, which we won't be able to do if we want thread-parallel fill. Yet, I'm almost sure somebody will complain if we break this invariant in Tpetra. I broke a lot of tests when I tried getting rid of lazy allocation; see Issue #119.)

I don't mean to say that we should keep Tpetra's interface as it is. What I mean to do is advocate caution.

mhoemmen commented 8 years ago

I think the main issue is virtual method call overhead. Epetra_CrsMatrix::NumMyEntries is NOT virtual. Epetra_CrsMatrix::NumMyRowEntries IS virtual, inherited from Epetra_RowMatrix. Tpetra::CrsMatrix::getNumEntriesInLocalRow is also virtual, inherited from Tpetra::RowMatrix. Here are some results (run time in seconds, over 2000 trials, with 10000 rows in the matrix and 10 entries per row) with Clang 3.7 on MacOS X on a laptop, release build:

Epetra ExtractMyRowView (virtual) 0.3318 (1) Epetra NumMyEntries (NOT virtual) 0.02076 (1) Epetra NumMyRowEntries (virtual) 0.1363 (1) Kokkos parallel (Threads?) 0.01808 (1) Kokkos sequential 0.00918 (1) Tpetra getLocalRowView (virtual) 0.8482 (1) Tpetra getNumEntriesInLocalRow (virtual) 0.265 (1)

When I replace the implementation of Tpetra::CrsMatrix::getNumEntriesInLocalRow with return 10;, the getNumEntriesInLocalRow timing becomes 0.09456. That actually beats Epetra's NumMyRowEntries. Thus, the issue we can't fix is virtual method call overhead. I think we can fix Tpetra's performance besides that, for example by getting around the getRowInfo stuff that happens inside (that does a lot more work than the method really needs).

mhoemmen commented 8 years ago

I'm moving this to the backlog. We've noted some possible improvements in Tpetra, but please note that a big part of performance issues for methods that don't do much work, is virtual method call overhead.

mhoemmen commented 8 years ago

I'm closing this issue, because we need more specific use cases to drive it. KokkosSparse::CrsMatrix already has a low-overhead row access interface.

trilinos / Trilinos

Performance of looping over Tpetra CrsMatrix rows #118