Open wppowers opened 1 year ago
Automatic mention of the @trilinos/muelu team
@trilinos/muelu
Automatic mention of the @trilinos/muelu team
Looks like something quite weird about the graph coloring:
Algo "Graph Coloring"
algorithm: serial
num colors: 36927
I have been looking into this some more and I found some additional information that may be helpful. If I run this test case with MPI using 16 process and 1 thread, 14.0 is only 2x slower than 12.10 (The logs from my initial report were from runs using 16 threads and a single process). Additionally the largest number of colors reported drops into the teens.
Additionally if I set the MueLu parameter 'use kokkos refactor" to false, then I get nearly identical performance with 16 threads (0.09s with 14.0 vs 0.05s with 12.10)
@wppowers Could you give some more details on the problem? The matrix looks quite non-symmetric. Could you also get us the timing log for you runs?
Thanks for pointing that out @cgcgcg . Distance-2 coloring behavior is undefined on matrices that aren't structurally symmetric (except for entries corresponding to ghosted columns).
You can test for this using the KokkosKernels D2 coloring driver (if examples are enabled) in packages/kokkos-kernels/perf_test/graph/
like this (it's the --verbose flag that checks for symmetry):
[bmkelle]$ ./graph_color_d2 --cuda 0 --symmetric_d2 --verbose --amtx maxMatrixTest-A-ts0.mtx
Sizeof(kk_lno_t) : 4
Sizeof(size_type): 4
Num verts: 88043
Num edges: 728436
Distance-2 Graph is nonsymmetric (INVALID INPUT)
@brian-kelley What coloring algo supports non-sym? The version 12 results seem to suggest that we used to have an algo that can handle that.
None of them support that really, if you pass in a nonsymmetric matrix it might work or it might not and that was always true.
I'm kind of confused where the number of colors is coming from in version 12.10 though. It predates the kokkos-kernels package, so serial D2 coloring lived in Tpetra, but MueLu didn't call it anywhere. UncoupledAggregationFactory_kokkos
existed but its phase 1 algo was a copy-paste of the non-Kokkos version.
Ok, I guess that makes sense. So we should probably add an option to symmetrize the graph.
To clarify my previous comment regarding the reduction in the number of colors reported. I was referring to what happens with version 14.0 when I precondition the example matrix using 16 processes instead of 16 threads. I have attached additional logs from the following scenarios:
To answer the question about the problem that the matrix resulted from, the matrix was generated from a transient advection-diffusion heat transfer problem.
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE
label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE
.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.
Some additional information. I performed the same procedure on a symmetric matrix, and am seeing a similar performance regression. I have two questions:
@wpp-tai Would you mind attaching log files of what you are seeing? I just had another look at the logs from last year that @wppowers posted and part of the issue might be that the coarsening is stalling.
Oh, @wpp-tai == @wppowers ?
Assuming the driver is still the same, can you add a line
MueLuList->set( "coarse: max size", 2000);
This should avoid the stagnation issue. We might also want to enable rebalancing
MueLuList->set( "rebalancing: enable", true);
In order to better diagnose what actually got slower, it would be helpful to print timers, see e.g. https://github.com/trilinos/Trilinos/blob/b40a3bc0b292e87a835cb711e1848409d9205128/packages/muelu/test/scaling/Driver.cpp#L368-L376 and https://github.com/trilinos/Trilinos/blob/b40a3bc0b292e87a835cb711e1848409d9205128/packages/muelu/test/scaling/Driver.cpp#L553-L587
Thanks for the ideas. I have some encouraging results, and a few more questions.
I grabbed the version 16.0 release of Trilinos to test with. At least for my standalone program, the performance is now better compared to version 12.10.1... which is great news! This is true even without adding the line MueLuList->set("use kokkos refactor", false );
.
As a more thorough test, I linked version 16.0 with our full application. I am currently running into a crash in Trilinos at runtime, which is a bit unexpected since our code for executing the Trilinos precondition and solve is nearly identical to the driver program.
I'm not sure if this is related, but in the MueLu verbose output, I see this line for the driver program when linked against Trilinos 16.0.0 and when our application is linked against Trilinos 12.10.1 Setup Smoother (MueLu::Ifpack2Smoother{type = RELAXATION})
, but when linking our application against Trilinos 16.0.0 I instead see: Setup Smoother (MueLu::Amesos2Smoother{type = Klu})
. How does that get determined? My smoother settings are the same in all cases.
We need to set up a "smoother" on each level of the multigrid hierarchy. With your settings you should be getting an Ifpack2 smoother on all level but the coarsest one. On the coarsest one you are getting a direct solver (Klu or Klu2) instead. Now, this should not have changed between versions. Could you post a fresh set of log files for what you are seeing?
Sure. The output for the driver program is much larger since it runs successfully. The full application output ends right before the runtime error. Both programs are linked against Trilinos 16.0.0. full-application.txt driver-prog.txt
Maybe I misunderstood before, but where is the crash? The problem in the "full-application" log is small enough that we immediately just set up a direct solver, not a full multigrid preconditioner. (That's the effect of MueLuList->set( "coarse: max size", 2000);
)
What happens if you set MueLuList->set( "coarse: max size", 1);
?
That change results in the expected smoother output, but I still see the crash. The crash is silent unless I am running in GDB, where I can see the stack trace. I suspect that the issue is related to some changes I had to make related to the GlobalOrdinal being of type 'long long' now as opposed to 'int'. I'm hoping that once I resolve that issue, I'll be able to run the solve.
Attached is the output using the timers as suggested. I also have 'extreme' verbose output turned on since that provides additional timing information. A few things worth mentioning:
MueLuList->set("use kokkos refactor", false );
or else the version 16 performance was REALLY bad. At this point, I am mostly trying to understand the roughly 3x slowdown when the kokkos refactor is off. If desired, I can also produce a log with that option turned on .MueLuList->set( "coarse: max size", 2000);
did result in a small improvement, but the relative difference in performance between versions 12 and 16 was unchanged.MueLuList->set( "rebalancing: enable", true);
, but at runtime I get a long list of what I assume are valid parameters, and then Trilinos exits as if that setting is not recognized.Thanks for taking a look at this output. Let me know if any other information would be helpful.
Symmetric Gauss-Seidel
instead of MT Symmetric Gauss-Seidel
to get a baseline?rebalancing: enable
it should be repartition: enable
. But that will only have impact for runs on more than 1 rank.Sure, I have logs when using Symmetric Gauss-Seidel attached. It looks like version 16 performs better than version 12 in this scenario.
I'm somewhat confused by these logs. It seems that the number of solves performed is different for each of them?
Yes. Our application solves until a tolerance is reached. It seems like using different versions of Trilinos and different smoothers impacts convergence and thus number of solves. This is something that we have experienced before.
Orthogonal question: does the linear system change between solves? It looks like the matrix structure is identical. If the values don't change either, there is no need to recompute the preconditioner.
The scenario that I am running to get these numbers eventually diverges unless we recompute the preconditioner with a high level of frequency. Without getting into too much detail, we provide some level of control over the frequency at which the preconditioner is recomputed. But there are scenarios such as this one where conditions change at an unpredictable point. And since we cannot predict when that will happen, the preconditioner is just recomputed prior to each solve.
To make the timing numbers easier to compare, I generated the timing logs again. But this time, I changed the settings to ensure that each run completes exactly 100 solves and preconditioner computations.
version12-mt-sym-gs.txt version12-sym-gs.txt version-16-mt-sym-gs.txt version-16-sym-gs.txt
MTGS smoothing slowed down by a factor of 2x-3x. It looks like some additional options were added, but as far as I can tell the default behavior should not have changed. @brian-kelley do you have some ideas what might have happened?
version-12-sym-gs-symmetricMatrix.txt version-12-mt-sym-gs-symmetricMatrix.txt version-16-sym-gs-symmetricMatrix.txt version-16-mt-sym-gs-symmetricMatrix.txt
Last year there seemed to be some concern that the matrix I was having issues with is non-symmetric. I did some additional testing with a symmetric matrix, and am seeing a similar slowdown. I have logs attached from solving that matrix. If there is anything else I can do to help determine the source of the slowdown, please let me know.
I am working on upgrading from version 12.10.1 of Trilinos to version 14. I have noticed that in certain instances, the performance is significantly worse in version 14. I have narrowed this regression down to the preconditioner computation (both in the initial creation and in subsequent re-computations). I have attached some small sample programs and inputs which can be used to reproduce the issue. I am seeing that version 14 is roughly 35X slower in computing the preconditioner.
I am assuming this is a performance regression, but I have also included MueLu logs from runs of the sample programs, one linked against version 12.10.1 and one linked against version 14, in case this reveals any settings that may be resulting in the poor performance I am seeing with version 14.
Any help in resolving this issue would be greatly appreciated.
If any additional information is needed, please let me know.
@jhux2 @csiefer2
exampleFiles.tar.gz