kokkos / kokkos-kernels

Kokkos C++ Performance Portability Programming Ecosystem: Math Kernels - Provides BLAS, Sparse BLAS and Graph Kernels
Other
311 stars 98 forks source link

Enhancement: avoid computing/storing zero entries during SPGEMM #1023

Open jhux2 opened 3 years ago

jhux2 commented 3 years ago

In CPU versions of the sparse matrix matrix product A*B in Tpetra and Epetra, there is logic to detect zeros entries in A and avoid computing with them. The GPU and OpenMP implementations in KokkosKernels don't have this short-circuit logic. I attempted to put the logic into the computation phase, but this resulted in runtime errors. Brian K. suggested that similar logic might have to also be put in the symbolic phase. The difficulty that I ran into is that the symbolic interface doesn't currently need the matrix values, and I'm not sure I understand the code structure well enough to introduce this new parameter.

Backstory: It's not uncommon for MueLu to generate matrices with zero entries during coarsening. (This is done in order to avoid having to create a new CrsGraph, which is expensive.) As a result, an AMG hierarchy that is setup on CPU can have a much lower operator complexity than the same hierarchy setup on GPU. (Operator complexity is a measure of the nonzeros in the multigrid hierarchy, lower is better.)

@brian-kelley @csiefer2 @lucbv @srajama1

srajama1 commented 3 years ago

I want to mention that the CPU logic is just a big hack. It uses values in the symbolic phase. (i) This is against good software design. This will confuse almost every one in the community who assume symbolic works on the graph and not the values (ii) this will basically make the reuse of symbolic useless, as the symbolic need to be called again and again every iteration of non-linear solve as the values of A will change.

One correct way to do this is compress the C as we detect the values of zero, however this will require creating a new CrsGraph in the numeric.

Note that this issue is not about avoiding compute (which is quite easy to do in numeric) but creating the correct CrsGraph in the first place.

I have two suggestions (i) One could do the same logic in MueLu where values are checked before CrsGraph creation, so a new CrsGraph does not have to be created as mentioned above (ii) can we find an alternative algorithm to the one that introduces so many zeros.

I am suggesting (i) because this is a MueLu specific use, and it is better to have it in MueLu. I am suggesting (ii) because this approach whether it is in MueLu or Tpetra/Kokkos Kernels avoids symbolic reuse which will increase setup costs. It is better long term to find an approach that doesn't increase setup costs.

jhux2 commented 3 years ago

@srajama1 I agree that passing values in during symbolic is counter-intuitive and confusing.

I'm not sure I understand suggestion (i) above. What we do now is create a second "filtered" matrix Af that has the same underlying CrsGraph as the application's discretization matrix A. If we keep all connections during coarsening, then Af=A. If we use strength-of-connection detection, then if entry a_ij in A is determined to be weak, then af_ij is set to zero. Otherwise, af_ij = a_ij. So we're using a CrsGraph that the application already created (or that was created by an RAP operation) -- there's no way to avoid creating that graph or short-circuiting during its creation.

srajama1 commented 3 years ago

@jhux2 What I am suggesting is in the same spirit as the spgemm changes. You do not use the graph based on input matrix (or from previous RAP), but wait till you decide on strength of connection. Once you determine the strength of connections, then you create a new filtered CrsGraph and insert values of the filtered matrix in another pass of input matrix. This would require two passes of A, but would work on CPUs and GPUs. The filtered matrix will have the "correct" graph in this case.

jhux2 commented 3 years ago

@srajama1 Thanks for the explanation. That would require one more CrsGraph construction than what we have now.

srajama1 commented 3 years ago

@jhux2 That is correct, but lot less memory than we have now (where we don't drop any entries because of KK symbolic), right? This is also cleaner and it is just pointers and column indices, not values.

lucbv commented 3 years ago

I also suggested an approach where a mask is provided with the matrix, it's cleaner than relying on zeros of the matrix but it adds additional data and requires us to modify the implementation of spgemm. On the other hand it is pretty flexible and only require an addition view that would be equivalent to the column indices.

jhux2 commented 3 years ago

@lucbv Introducing a mask would still require logic changes in SPGEMM to account for the nonzeros, would it not?

@srajama1 wrote:

That is correct, but lot less memory than we have now (where we don't drop any entries because of KK symbolic), right? This is also cleaner and it is just pointers and column indices, not values.

I don't know what the overall net benefit would be.

brian-kelley commented 3 years ago

@jhux2 My 2c is that it would be simpler and faster to just reconstruct the graph after dropping. SpGEMM is very memory intensive and may read out each row of B many times (once each time it appears as a column entry of A). So doing some kind of mask structure, or just reading directly from values would effectively double the number of accesses it does.

But rebuilding the local matrix without explicit zeros is simple and fast - a reduce to count the nonzeros, and a scan to fill them into the new matrix. We already have the maps, and we don't care about sorting/merging, so we could use expertStaticFillComplete which is supposed to be fast.

The only downside I can think of is that this approach increases the memory high water mark (in the worst case doubling it, if no entries were dropped). Is running out of memory something that could ever happen in practice? (we're about to do an SpGEMM, so we should have at least enough memory for C, but that could be less than both A and B).

lucbv commented 3 years ago

@jhux2 yes that why I stated:

it ... requires us to modify the implementation of spgemm

@brian-kelley the main computation needed with the filtered A is the calculation of import and export objects that would probably change since the column space would change for A. There rest is indeed computationally cheap. Regarding the memory foot print, we hope that it would not be an issue but on the finest level that would still be a significant memory increase, assuming we generally do not drop more than say 20% of entries then would still need to allocate 80% of A.

jhux2 commented 3 years ago

@lucbv said:

the main computation needed with the filtered A is the calculation of import and export objects that would probably change since the column space would change for A. There rest is indeed computationally cheap.

@lucbv Thanks. That's exactly what concerns me, not so much the memory high water mark. @brian-kelley But maybe there's a way to build the import/export object cheaply by leveraging those from the original matrix.

srajama1 commented 3 years ago

What @brian-kelley is proposing and what I proposed is in the similar spirit. Just use the memory for better design and long term maintenance of the software.

Thinking more about it, I don't know how this is too much extra memory. A and A_f get stored any way. What I am saying is the store the graph of A_f, which is the same order as the values array for A_f that you store + a pointer array.

What @brian-kelley is saying is compute the result and add a helper to drop entries and compress matrix. We could give this helper (input is a CRS matrix, o/p is a new CRS matrix with zeroes dropped) if that is useful. You could call it from MueLu or Tpetra.

The second one just increases the high watermark for the memory.

jhux2 commented 3 years ago

@srajama1 @brian-kelley @lucbv Thanks for the good discussion. Thinking it over, I now agree with Brian that the import (export) object would still be valid because the column indices of the matrix with zeros removed are a subset of the column indices of the original matrix. expertStaticFillComplete recomputes the column map, so there's some communication, but the really heavy operation of creating an import (export) object is avoided.