MueLu: Drastic slowdown in preconditioner computation

wppowers commented 1 year ago

I am working on upgrading from version 12.10.1 of Trilinos to version 14. I have noticed that in certain instances, the performance is significantly worse in version 14. I have narrowed this regression down to the preconditioner computation (both in the initial creation and in subsequent re-computations). I have attached some small sample programs and inputs which can be used to reproduce the issue. I am seeing that version 14 is roughly 35X slower in computing the preconditioner.

I am assuming this is a performance regression, but I have also included MueLu logs from runs of the sample programs, one linked against version 12.10.1 and one linked against version 14, in case this reveals any settings that may be resulting in the poor performance I am seeing with version 14.

Any help in resolving this issue would be greatly appreciated.

If any additional information is needed, please let me know.

@jhux2 @csiefer2

exampleFiles.tar.gz

github-actions[bot] commented 1 year ago

Automatic mention of the @trilinos/muelu team

jhux2 commented 1 year ago

@trilinos/muelu

github-actions[bot] commented 1 year ago

Automatic mention of the @trilinos/muelu team

cgcgcg commented 1 year ago

Looks like something quite weird about the graph coloring:

Algo "Graph Coloring"
  algorithm: serial
  num colors: 36927

wppowers commented 1 year ago

I have been looking into this some more and I found some additional information that may be helpful. If I run this test case with MPI using 16 process and 1 thread, 14.0 is only 2x slower than 12.10 (The logs from my initial report were from runs using 16 threads and a single process). Additionally the largest number of colors reported drops into the teens.

Additionally if I set the MueLu parameter 'use kokkos refactor" to false, then I get nearly identical performance with 16 threads (0.09s with 14.0 vs 0.05s with 12.10)

cgcgcg commented 1 year ago

@wppowers Could you give some more details on the problem? The matrix looks quite non-symmetric. Could you also get us the timing log for you runs?

brian-kelley commented 1 year ago

Thanks for pointing that out @cgcgcg . Distance-2 coloring behavior is undefined on matrices that aren't structurally symmetric (except for entries corresponding to ghosted columns).

You can test for this using the KokkosKernels D2 coloring driver (if examples are enabled) in packages/kokkos-kernels/perf_test/graph/ like this (it's the --verbose flag that checks for symmetry):

[bmkelle]$ ./graph_color_d2 --cuda 0 --symmetric_d2 --verbose --amtx maxMatrixTest-A-ts0.mtx 
Sizeof(kk_lno_t) : 4
Sizeof(size_type): 4
Num verts: 88043
Num edges: 728436

Distance-2 Graph is nonsymmetric (INVALID INPUT)

cgcgcg commented 1 year ago

@brian-kelley What coloring algo supports non-sym? The version 12 results seem to suggest that we used to have an algo that can handle that.

brian-kelley commented 1 year ago

None of them support that really, if you pass in a nonsymmetric matrix it might work or it might not and that was always true.

I'm kind of confused where the number of colors is coming from in version 12.10 though. It predates the kokkos-kernels package, so serial D2 coloring lived in Tpetra, but MueLu didn't call it anywhere. UncoupledAggregationFactory_kokkos existed but its phase 1 algo was a copy-paste of the non-Kokkos version.

cgcgcg commented 1 year ago

Ok, I guess that makes sense. So we should probably add an option to symmetrize the graph.

wppowers commented 1 year ago

To clarify my previous comment regarding the reduction in the number of colors reported. I was referring to what happens with version 14.0 when I precondition the example matrix using 16 processes instead of 16 threads. I have attached additional logs from the following scenarios:

Version 14.0 using 16 processes and 1 thread
Version 14.0 using 16 threads and 1 process, when setting "use kokkos refactor" to false
Version 12.10 using 16 processes and 1 thread

To answer the question about the problem that the matrix resulted from, the matrix was generated from a transient advection-diffusion heat transfer problem.

logs.tar.gz

github-actions[bot] commented 2 months ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

wpp-tai commented 2 months ago

Some additional information. I performed the same procedure on a symmetric matrix, and am seeing a similar performance regression. I have two questions:

Is this an expected regression? i.e. was there a bug fix that results in a better answer at the cost of speed?
Am I doing something that is obviously wrong? If so, do you have any recommended settings for getting the best performance?

cgcgcg commented 2 months ago

@wpp-tai Would you mind attaching log files of what you are seeing? I just had another look at the logs from last year that @wppowers posted and part of the issue might be that the coarsening is stalling.

Oh, @wpp-tai == @wppowers ?

cgcgcg commented 2 months ago

Assuming the driver is still the same, can you add a line

MueLuList->set( "coarse: max size", 2000);

This should avoid the stagnation issue. We might also want to enable rebalancing

MueLuList->set( "rebalancing: enable", true);

In order to better diagnose what actually got slower, it would be helpful to print timers, see e.g. https://github.com/trilinos/Trilinos/blob/b40a3bc0b292e87a835cb711e1848409d9205128/packages/muelu/test/scaling/Driver.cpp#L368-L376 and https://github.com/trilinos/Trilinos/blob/b40a3bc0b292e87a835cb711e1848409d9205128/packages/muelu/test/scaling/Driver.cpp#L553-L587

wpp-tai commented 1 month ago

Thanks for the ideas. I have some encouraging results, and a few more questions.

I grabbed the version 16.0 release of Trilinos to test with. At least for my standalone program, the performance is now better compared to version 12.10.1... which is great news! This is true even without adding the line MueLuList->set("use kokkos refactor", false );.

As a more thorough test, I linked version 16.0 with our full application. I am currently running into a crash in Trilinos at runtime, which is a bit unexpected since our code for executing the Trilinos precondition and solve is nearly identical to the driver program. I'm not sure if this is related, but in the MueLu verbose output, I see this line for the driver program when linked against Trilinos 16.0.0 and when our application is linked against Trilinos 12.10.1 Setup Smoother (MueLu::Ifpack2Smoother{type = RELAXATION}), but when linking our application against Trilinos 16.0.0 I instead see: Setup Smoother (MueLu::Amesos2Smoother{type = Klu}). How does that get determined? My smoother settings are the same in all cases.

cgcgcg commented 1 month ago

We need to set up a "smoother" on each level of the multigrid hierarchy. With your settings you should be getting an Ifpack2 smoother on all level but the coarsest one. On the coarsest one you are getting a direct solver (Klu or Klu2) instead. Now, this should not have changed between versions. Could you post a fresh set of log files for what you are seeing?

wpp-tai commented 1 month ago

Sure. The output for the driver program is much larger since it runs successfully. The full application output ends right before the runtime error. Both programs are linked against Trilinos 16.0.0. full-application.txt driver-prog.txt

cgcgcg commented 1 month ago

Maybe I misunderstood before, but where is the crash? The problem in the "full-application" log is small enough that we immediately just set up a direct solver, not a full multigrid preconditioner. (That's the effect of MueLuList->set( "coarse: max size", 2000);)

jhux2 commented 1 month ago

What happens if you set MueLuList->set( "coarse: max size", 1);?

wpp-tai commented 1 month ago

That change results in the expected smoother output, but I still see the crash. The crash is silent unless I am running in GDB, where I can see the stack trace. I suspect that the issue is related to some changes I had to make related to the GlobalOrdinal being of type 'long long' now as opposed to 'int'. I'm hoping that once I resolve that issue, I'll be able to run the solve.

wpp-tai commented 1 month ago

Attached is the output using the timers as suggested. I also have 'extreme' verbose output turned on since that provides additional timing information. A few things worth mentioning:

I did have to include MueLuList->set("use kokkos refactor", false ); or else the version 16 performance was REALLY bad. At this point, I am mostly trying to understand the roughly 3x slowdown when the kokkos refactor is off. If desired, I can also produce a log with that option turned on .
The logs contain a small amount of output from our full application, but the contents is mostly Trilinos output.
Adding MueLuList->set( "coarse: max size", 2000); did result in a small improvement, but the relative difference in performance between versions 12 and 16 was unchanged.
I tried using MueLuList->set( "rebalancing: enable", true);, but at runtime I get a long list of what I assume are valid parameters, and then Trilinos exits as if that setting is not recognized.
The total number of calls of various functions is greater in the version 16 output because the solve converges more slowly. But each function call takes longer as well.

Thanks for taking a look at this output. Let me know if any other information would be helpful.

version-16.txt version12.txt

cgcgcg commented 1 month ago

From the timing logs it looks like the multithreaded Gauss-Seidel caused the slowdown. Could you try Symmetric Gauss-Seidel instead of MT Symmetric Gauss-Seidel to get a baseline?
Sorry, I gave the wrong option. Instead of rebalancing: enable it should be repartition: enable. But that will only have impact for runs on more than 1 rank.

wpp-tai commented 1 month ago

Sure, I have logs when using Symmetric Gauss-Seidel attached. It looks like version 16 performs better than version 12 in this scenario.

version12-sym-gs.txt version-16-sym-gs.txt

cgcgcg commented 1 month ago

I'm somewhat confused by these logs. It seems that the number of solves performed is different for each of them?

wpp-tai commented 1 month ago

Yes. Our application solves until a tolerance is reached. It seems like using different versions of Trilinos and different smoothers impacts convergence and thus number of solves. This is something that we have experienced before.

cgcgcg commented 1 month ago

Orthogonal question: does the linear system change between solves? It looks like the matrix structure is identical. If the values don't change either, there is no need to recompute the preconditioner.

wpp-tai commented 1 month ago

The scenario that I am running to get these numbers eventually diverges unless we recompute the preconditioner with a high level of frequency. Without getting into too much detail, we provide some level of control over the frequency at which the preconditioner is recomputed. But there are scenarios such as this one where conditions change at an unpredictable point. And since we cannot predict when that will happen, the preconditioner is just recomputed prior to each solve.

To make the timing numbers easier to compare, I generated the timing logs again. But this time, I changed the settings to ensure that each run completes exactly 100 solves and preconditioner computations.

version12-mt-sym-gs.txt version12-sym-gs.txt version-16-mt-sym-gs.txt version-16-sym-gs.txt

cgcgcg commented 1 month ago

MTGS smoothing slowed down by a factor of 2x-3x. It looks like some additional options were added, but as far as I can tell the default behavior should not have changed. @brian-kelley do you have some ideas what might have happened?

wpp-tai commented 1 month ago

version-12-sym-gs-symmetricMatrix.txt version-12-mt-sym-gs-symmetricMatrix.txt version-16-sym-gs-symmetricMatrix.txt version-16-mt-sym-gs-symmetricMatrix.txt

Last year there seemed to be some concern that the matrix I was having issues with is non-symmetric. I did some additional testing with a symmetric matrix, and am seeing a similar slowdown. I have logs attached from solving that matrix. If there is anything else I can do to help determine the source of the slowdown, please let me know.

trilinos / Trilinos

MueLu: Drastic slowdown in preconditioner computation #12175