trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.22k stars 570 forks source link

MueLu: Proposal: merge CoalesceDrop and FilteredA factories #1676

Closed aprokop closed 3 years ago

aprokop commented 7 years ago

@trilinos/muelu @jhux2 @tawiesn

Summary

Right now both factories do very similar things. Each one has a loop that goes through a matrix and filters some entries. The proposal combines those loops in one thus speeding up execution.

Description

The current CoalesceDropFactory is responsible for a single thing: constructing a (amalgamated) filtered graph that is later used in the aggregation. For the filtered scenario (when drop tolerance is not 0) it has a loop that goes through rows one by one and for each row entry determines whether to drop it (based on the original matrix or distance laplacian). It simultaneously constructs the compressed graph (LWGraph).

The construction of the the filtered matrix in the FilteredAFactory has two variants: 1) through reusing the graph of the original matrix and zeroing out entries; and 2) though construction a brand new matrix with the compressed graph. In both cases, the looping is done through rows and uses a filter array that helps to determine which matrix values to drop/zero out.

The loop in the FilteredAFactory is remarkably similar to the loop in the CoalesceDropFactory. In fact, the only difference is that in the later it constructs only rows and columns, and in the former it constructs values. The work is duplicated, and in fact it's even more expensive to go through the matrix the second time as we have to determine again which entries to filter.

Proposed solution

In the proposed solution, the CoalesceDropFactory (or its renamed version) will construct both LWGraph that is used in aggregation and the filtered matrix if desired. The FilteredAFactory goes away. The filtering loop in the factory will construct rows, columns, and values. For the block variant it will also construct coalesced rows and columns.

Benefits

Looping through the level matrix is done once potentially achieving significant speedup, especially when reusing matrix graph.

This would benefit applications that use MueLu with filtering.

Q&A

Are there issues with block systems?

So far, I don't see any issues. The block systems are treated by first filtering a block row, and the coalescing it. Constructing filtered matrix is independent of coalescing, though they both depend on the filtering.

What about a special non-lightweight graph branch in CoalesceDropFactory?

Tricky question. I don't have a good answer as I don't know what that does and what is special about it. @tawiesn ?

Does it benefit Kokkos version?

It should benefit Kokkos version the same way. In fact, it could be merged with another idea where we do not even construct the compressed graph for LWGraph_kokkos but rather provide a wrapper around the graph of the original matrix. This would eliminate the 2nd loop in the current CoalesceDropFactory_kokkos.

What are the steps to implement this?

I will start implementing it in the kokkos-refactor branch of MueLu for the scalar case. This can be done independently from the default branch through modifying the dependency tree in the ParameterListInterpreter. If the result demonstrate the feasibility and speedup using this approach, we could discuss backporting to the non-kokkos version.

tawiesn commented 7 years ago

What about a special non-lightweight graph branch in CoalesceDropFactory?

The main difference to the lightweight graph version is that it uses AmalgamationFactory routines to translate between nodes and dofs and that it can deal with arbitrary maps (using global ids). It is used only for special applications with non-standard GIDs and i'm not planning porting it to Kokkos. Most of the multiphysics applications i'm working on now use Thyra based GID ordering and therefore are compatible with the lightweight version. The background idea was/is that a user could have a special AmalgamationFactory and all information about mapping Node IDs <-> DOF IDs would be handled by the AmalgamationFactory only. However, we never really used all the potential of that approach.

aprokop commented 7 years ago

@tawiesn Thanks for the explanation! Are the current users of that branch? Do folks in TUM still use it?

tawiesn commented 7 years ago

@aprokop I like the idea of merging the thresholding and CoalesceAndDropFactory.

I think you hard-code the standard thresholding mechanism in the CoalesceDropFactory, right? Do you think it is possible to have a more flexible way to define the dropping strategy? I know of external applications (e.g., Multigrid for UQ with ensemble scalar types which need special ensemble dropping mechanisms) which would need more sophisticated dropping mechanisms. Well, they still can have a separate Filtering Factory producing a filtered "A" which is used as input for the CoalesceDropFactory (without filtering then), but maybe we can do better than this. Any ideas?

tawiesn commented 7 years ago

@aprokop Yes, there are current users of the old branch in the US (external project with industry). The people in Munich might also use it (but with an old version of Trilinos; not sure they will ever update). I don't think i would remove it. I would just "fade it out". I know that for the external project we don't need Kokkos as they use Windows (which still cannot compile Kokkos anyway)... As i said: i would focus on the Kokkos branch and make sure we gain the desired performance there.

aprokop commented 7 years ago

As i said: i would focus on the Kokkos branch and make sure we gain the desired performance there.

I would be fine with that. But if we demonstrate significant gains there, it could also benefit regular users (not using Kokkos). Well, maybe @jhux2 will port it then :)

aprokop commented 7 years ago

I think you hard-code the standard thresholding mechanism in the CoalesceDropFactory, right? Do you think it is possible to have a more flexible way to define the dropping strategy?

For now yes, hardcoding. But this is really orthogonal to the discussion. It does not matter how you drop entries as long as you can do that independently for each entry. We can probably think of a mechanism to extend it through making the dropping a kokkos function, or something like that.

tawiesn commented 7 years ago

For now yes, hardcoding. But this is really orthogonal to the discussion. It does not matter how you drop entries as long as you can do that independently for each entry. We can probably think of a mechanism to extend it through making the dropping a kokkos function, or something like that.

Ok. I'm fine with hardcoding it. Can be extended any time later.

aprokop commented 7 years ago

@tawiesn Thinking further, we can actually make use of it now. If we make the loop a function/functor and template it on some Filter which supports

bool operator()(LocalOrdinal i, LocalOrdinal j, Scalar value, MagnitudeType eps);

we would be able to unify classical and distance laplacian variants. I think this would work. Just need to make sure we don't lose performance.

tawiesn commented 7 years ago

@aprokop What are the parameters of the functor? Row, column, matrix entry and a drop tolerance? Returns true if the entry should be kept, otherwise false? The user can provide a application-specific functor. If he does not we would use the default functor (or the distLap functor?)

aprokop commented 7 years ago

What are the parameters of the functor? Row, column, matrix entry and a drop tolerance?

Yes!

Returns true if the entry should be kept, otherwise false?

I was thinking the opposite: true if dropped, false if not.

The user can provide a application-specific functor. If he does not we would use the default functor (or the distLap functor?)

Yes. The question would be: what's the best way to inject user's C++ code into MueLu.

aprokop commented 7 years ago

I've implemented (but not fully tested yet) the first version of the new (merged) CoalesceDropFactory and ran the driver on a _single processor with OMP_NUMTHREADS=1

with the following parameters:

$ ./MueLu_Driver.exe --matrixType=Laplace3D --nx=200 --ny=200 --nz=200 --node=openmp --xml=scaling_kokkos.xml --solver=none

using the following input deck:

<ParameterList name="MueLu">
  <!-- ===========  GENERAL ================ -->
    <Parameter        name="verbosity"                            type="string"   value="high"/>
    <Parameter        name="print initial parameters"             type="bool"     value="true"/>
    <Parameter        name="use kokkos refactor"                  type="bool"     value="true"/>
    <Parameter        name="synchronize factory timers"           type="bool"     value="true"/>
    <Parameter        name="coarse: max size"                     type="int"      value="1000"/>
    <Parameter        name="multigrid algorithm"                  type="string"   value="sa"/>
    <Parameter        name="transpose: use implicit"              type="bool"     value="true"/>

    <!-- start of default values for general options (can be omitted) -->
    <Parameter        name="max levels"                           type="int"      value="10"/>
    <Parameter        name="number of equations"                  type="int"      value="1"/>
    <Parameter        name="sa: use filtered matrix"              type="bool"     value="true"/>
    <!-- end of default values -->

  <!-- ===========  AGGREGATION  =========== -->
    <Parameter        name="aggregation: type"                    type="string"   value="uncoupled"/>
    <Parameter        name="aggregation: drop scheme"             type="string"   value="classical"/>
    <!-- Uncomment the next line to enable dropping of weak connections, which can help AMG convergence
         for anisotropic problems.  The exact value is problem dependent. -->
    <Parameter        name="aggregation: drop tol"                type="double"   value="0.02"/>
    <!-- <Parameter        name="filtered matrix: reuse graph"         type="bool"     value="false"/> -->

  <!-- ===========  SMOOTHING  =========== -->
    <Parameter        name="smoother: type"                       type="string"   value="CHEBYSHEV"/>
    <ParameterList    name="smoother: params">
      <Parameter      name="chebyshev: degree"                    type="int"      value="2"/>>
      <Parameter      name="chebyshev: ratio eigenvalue"          type="double"   value="7"/>
      <Parameter      name="chebyshev: min eigenvalue"            type="double"   value="1.0"/>
      <Parameter      name="chebyshev: zero starting solution"    type="bool"     value="true"/>
    </ParameterList>
</ParameterList>

The result is attached. The cost of the new CoalesceDropFactory has slightly increased but the full cost of the FilteredAFactory is gone.

new_vs_new

This is with a single loop now instead of three + an additional (slightly smarter) compression routine in the new CoalesceDropF. If also supports constructing a new graph but with this parameters it's more expensive as tol=0.02 does not actually drop anything.

The code is available here.

aprokop commented 7 years ago

And here is how the new CoalesceDropFactory compares agains non-Kokkos (standard) branch.

old_vs_new

tawiesn commented 7 years ago

@aprokop Nice graphs. It's hard to see how much time we save with your changes. Seems that the simulation is about 2 seconds faster than before? Can you give numbers how much time we save relative to the overall solver time? For the non-kokkos version: have you ported your changes to the non-kokkos version or do you just compare the kokkos run with OMP_NUM_THREADS=1 with the standard non-kokkos version?

tawiesn commented 7 years ago

@jhux2 What's the status of the matrix-matrix multiplication? @aprokop Did you use the most recent changes there? Smoother setup (Chebyshev) as well as RAP product and prolongator smoothing (i.e. MM multiplication) are still the most expensive parts. And if i interpret @aprokop's graphs the kokkos version is slower as the non-kokkos version.

tawiesn commented 7 years ago

@aprokop Any idea what's wrong with the CoordinatesTransferFactory in the Kokkos version? The timings seem to be off quite a bit...

aprokop commented 7 years ago

Nice graphs. It's hard to see how much time we save with your changes. Seems that the simulation is about 2 seconds faster than before?

Sounds about right. The main thing is that it is no longer 2x expensive compared to serial.

Can you give numbers how much time we save relative to the overall solver time?

Problem dependent. This one save 2 seconds out of 35. Does not sound great... But the hope is when we go multiple threads it will scale better.

For the non-kokkos version: have you ported your changes to the non-kokkos version or do you just compare the kokkos run with OMP_NUM_THREADS=1 with the standard non-kokkos version?

I have not ported any changes to non-kokkos. This is what you are saying it is.

@aprokop Did you use the most recent changes there? Smoother setup (Chebyshev) as well as RAP product and prolongator smoothing (i.e. MM multiplication) are still the most expensive parts. And if i interpret @aprokop's graphs the kokkos version is slower as the non-kokkos version.

Yes, this is the tip of develop.

@aprokop Any idea what's wrong with the CoordinatesTransferFactory in the Kokkos version? The timings seem to be off quite a bit...

No idea at the moment, the cursory look did not reveal anything outstanding, and it starts to disappear with multiple threads. My current assumption is that Kokkos::View allocation is expensive comparatively.

aprokop commented 7 years ago

I'm closing this for now as there is no compelling reason to spend time backporting this to non-Kokkos branch.

jhux2 commented 7 years ago

Actually, I believe we may need to backport this to the non-Kokkos version of MueLu.

github-actions[bot] commented 3 years ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.

github-actions[bot] commented 3 years ago

This issue was closed due to inactivity for 395 days.