MueLu: Proposal: use special form of the tentative matrix for smoothed prolongator construction

aprokop commented 7 years ago

@trilinos/muelu @jhux2

Summary

Tentative factory produces a matrix of special form. This special form can be used to speed up matrix-matrix multiply procedure. For example, when nullspace is a constant vector, multiplying by tentative matrix is equivalent to averaging columns of matrix A. In addition, the tentative matrix can be stored in its "multivector" form, with columns compacted to multivectors as the aggregates do not overlap.

Description

Let us first start with the situation where we do construct the QR factorization.

In this case, let us store the constructed Q factors in a single multivector. The multivector is based based on the same map that the current Ptent matrix is on, i.e. the row map of A. The multiplication of A by P could be seen to be similar to matrix-vector multiplication with the main difference being that instead of summation of products into a single value, we instead put product pairs into the bins corresponding to aggregates. To do that, we need to be able to translate local column ids to global aggregates ids (ids in the coarseMap). This may require an import of integer vector.

Q&A

What changes if we do not want to do QR decomposition?

In this case, the TentativePFactory will set the "P" multivector to be the nullspace".

What is the expected performance improvement?

Ideally, we would skip the global tentative matrix construction, and instead just construct a local multivector and then do import (and, if domain map of A is the same as row map, we already have the Import object for it). In the SaPFactory, the hope is that the matrix-vector multiply-like procedure is about 2-3 times as expensive as regular matvec and is significantly cheaper than the full matrix-matrix multiplication. We should be able to do it using just local indices and parallelize over threads. There will be an additional cost of compression, as we don't know the number of nonzeros per row in the final matrix.

How to optimally do the matrix-vector like procedure?

An open question. There are some similarities with sortAndMerge procedure in Tpetra, so it may be possible to learn from that.

aprokop commented 7 years ago

I have a first version implemented. However, it did not gain performance so far, and lost about 15%. This is completely not optimized version. The good news is that the multiplication kernel is isolated so can be optimized. The other good news is the global communication is limited to 2 Import operations and couple map constructions so far.

I noticed that Kokkos Kernels' spgemm uses KokkosKernels::HashMapAccumulator. I think this may be useful to me. However, there is no documentation for that. @mndevec Could you please describe what that thing does, and how it is used in the matrix-matrix multiplication? My current assumption is that it allows you to throw pairs (index, value) and it will merge them to produce rows in matrix C. I could be wrong, though.

mndevec commented 7 years ago

@aprokop , Yes, unfortunately it lacks documentation, as I have been adding specific optimized functions for my specific needs as I go along. I use it to accumulate (index,value) pairs as you mentioned.

It is a hashmap accumulator for which you provide 4 arrays in the constructor. Begins - as size of hash values points to the index of the first pair with the given hash value Nexts - as the size of the hashmap accumulator, this is the linked list array, each of them pointing to the next pair having the same hash value. Ids - this is actual id's as the size of the hashmap accumulator. Values - the value pair as the size of the hashmap accumulator.

Some functions are called by a single vector lane, while some are called by a thread with multiple vector lanes. It should be easy to add one for team level, but I did not need yet, so I do not have it. I was adding functions as I needed. I have functions that perform "+", "bitwise or", "bitwise and" for accumulation. When I have free time, I was going to go and change it so that it takes an operator provided by the user so that it can be portable.

I was saving the interface for my internal use. If you think you need this, and if you cannot use Kokkos::UnorderedHashMap; we can talk more and we can work on polishing KokkosKernels::HashMapAccumulator, and integrate the functionalities provided by Kokkos::UnorderedHashMap as well.

The differences of KokkosKernels::HashMapAccumulator and Kokkos::UnorderedHashMap is as follows:

Kokkos::UnorderedHashMap does not accumulate values, that is the main reason I could not use it. KokkosKernels::HashMapAccumulator has accumulation specifically defined for "+", and some bitwise operations as I mentioned above.
Kokkos::UnorderedHashMap is a global hashmap. KokkosKernels::HashMapAccumulator can be vector, thread or team level hashmap accumulator.
Kokkos::UnorderedHashMap is located in the provided global memory space. On the other hand for KokkosKernels::HashMapAccumulator, by providing pointers in GPU's shared memory, or global memory, or even HostPinned memory KokkosKernels::HashMapAccumulator is oblivious to where data lives. I am using this feature for having multi-level HashMapAccumulator.

,

srajama1 commented 7 years ago

@mndevec : I believe lot more people will benefit from a vector, thread or team level accumulator. This is a primary differentiator. Let us plan on allowing Andrey to use it first at the thread level. The team level optimization can come later.

github-actions[bot] commented 3 years ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.

github-actions[bot] commented 3 years ago

This issue was closed due to inactivity for 395 days.

trilinos / Trilinos

MueLu: Proposal: use special form of the tentative matrix for smoothed prolongator construction #1743

Summary

Description

Q&A