Closed jewatkins closed 2 years ago
@jewatkins Thank you for reporting the issue. Can you tell me the difference between the "muk" tests and other tests (which don't appear to have this regression)? It would help to know what parts of Tpetra are being exercised; what are the preconditioner settings? We have added checks for needed synchronization in some routines that access DualViews; in one case, we saw that using these routines in tight loops degraded performance (e.g., #9241). We can check the preconditioner for similar usage, or look for other problems in Tpetra.
@trilinos/tpetra @trilinos/muelu
@jewatkins Could you post some more detailed timers from MueLu and Belos?
The "muk" tests use the kokkos path in muelu while the other tests "fea" are just finite element assembly (no solve). There is a small regression in the assembly:
which I've narrowed down to something related to how we apply our Dirichlet BC:
Phalanx: Evaluator 286: [
Preconditioner settings are here: https://github.com/ikalash/ali-perf-tests/blob/master/perf_tests/green-1-10km/input_albany_Enthalpy_MueLuKokkos_Tetrahedron.yaml#L227
I'll look for Tpetra calls in loops.
Unfortunately, I don't have the muelu timers for these runs (I keep forgetting to turn them on). This is what I have: Before:
| | | NOX Total Preconditioner Construction: 18.8447 - 20.9274% [25] {min=18.827, max=18.8737, std dev=0.0138322}
| | | | MueLu: ParameterListInterpreter (ParameterList): 0.0489341 - 0.25967% [25] {min=0.0463495, max=0.0512412, std dev=0.00195169}
| | | | Ifpack2::Relaxation::initialize: 0.396602 - 2.10458% [75] {min=0.382286, max=0.425869, std dev=0.0159126}
| | | | Ifpack2::Relaxation::compute: 0.47467 - 2.51885% [75] {min=0.442856, max=0.549974, std dev=0.0427481}
| | | | Utilities_kokkos::GetMatrixDiagonalInverse: 0.00501845 - 0.0266305% [25] {min=0.00423258, max=0.00565916, std dev=0.00053107}
| | | | Remainder: 17.9195 - 95.0903%
| | | NOX Total Linear Solve: 62.3684 - 69.2612% [25] {min=62.3427, max=62.3867, std dev=0.0123821}
| | | | Stratimikos: BelosLOWS: 62.3683 - 99.9998% [25] {min=62.3426, max=62.3866, std dev=0.0123841}
| | | | | Belos: Operation Op*x: 0.031208 - 0.0500382% [25] {min=0.0145054, max=0.0506949, std dev=0.0128977}
| | | | | Belos::MVT::MvAddMv: 0.00100305 - 0.00160828% [25] {min=0.000953609, max=0.00104134, std dev=3.91312e-05}
| | | | | Belos: BlockGmresSolMgr total solve time: 62.1199 - 99.6017% [25] {min=62.1055, max=62.1347, std dev=0.0132035}
| | | | | | Belos: ICGS[2]: Orthogonalization: 14.9014 - 23.9882% [4647] {min=13.2244, max=16.6988, std dev=1.10671}
| | | | | | | Belos: ICGS[2]: Ortho (Norm): 0.507179 - 3.40356% [4647] {min=0.38541, max=0.564136, std dev=0.0576391}
| | | | | | | | Belos::MVT::MvDot: 0.501827 - 98.9448% [4647] {min=0.379972, max=0.558914, std dev=0.0576883}
| | | | | | | | Remainder: 0.00535174 - 1.0552%
| | | | | | | Belos::MVT::MvScale: 0.139604 - 0.936851% [4647] {min=0.134938, max=0.148437, std dev=0.00422436}
| | | | | | | Belos: ICGS[2]: Ortho (Inner Product): 9.216 - 61.8464% [9244] {min=7.59734, max=10.9913, std dev=1.10174}
| | | | | | | | Belos::MVT::MvTransMv: 9.20311 - 99.8602% [9244] {min=7.58426, max=10.9782, std dev=1.10181}
| | | | | | | | Remainder: 0.0128867 - 0.13983%
| | | | | | | Belos: ICGS[2]: Ortho (Update): 5.00399 - 33.5806% [9244] {min=4.93607, max=5.18445, std dev=0.0910986}
| | | | | | | | Belos::MVT::MvTimesMatAddMv: 4.99255 - 99.7715% [9244] {min=4.92473, max=5.17285, std dev=0.0910147}
| | | | | | | | Remainder: 0.0114333 - 0.228484%
| | | | | | | Remainder: 0.0346597 - 0.232593%
| | | | | | Belos::MVT::MvAddMv: 0.0235671 - 0.0379381% [76] {min=0.0208356, max=0.0314318, std dev=0.00460369}
| | | | | | Belos::MVT::MvNorm: 0.0156675 - 0.0252214% [38] {min=0.00431255, max=0.0235406, std dev=0.00719162}
| | | | | | Belos: Operation Prec*x: 39.9978 - 64.3881% [4647] {min=38.9702, max=40.9678, std dev=0.605597}
| | | | | | | Ifpack2::Relaxation::apply: 28.1191 - 70.3015% [23235] {min=27.675, max=28.9219, std dev=0.537843}
| | | | | | | Remainder: 11.8788 - 29.6985%
| | | | | | Belos: Operation Op*x: 5.27984 - 8.49945% [4635] {min=4.50163, max=6.05698, std dev=0.554454}
| | | | | | Belos::MVT::MvTimesMatAddMv: 0.0450522 - 0.0725247% [25] {min=0.0428648, max=0.0514716, std dev=0.00330187}
| | | | | | Remainder: 1.85649 - 2.98855%
| | | | | Remainder: 0.216228 - 0.346696%
| | | | Remainder: 9.63243e-05 - 0.000154444%
| | | Remainder: 0.26288 - 0.291932%
After:
| | | NOX Total Preconditioner Construction: 29.9794 - 26.3475% [26] {min=29.9747, max=29.9835, std dev=0.0033793}
| | | | MueLu: ParameterListInterpreter (ParameterList): 0.0505766 - 0.168705% [26] {min=0.0482862, max=0.0530588, std dev=0.00236452}
| | | | Ifpack2::Relaxation::initialize: 0.499058 - 1.66467% [78] {min=0.491649, max=0.50641, std dev=0.00554256}
| | | | Ifpack2::Relaxation::compute: 0.197572 - 0.659026% [78] {min=0.191227, max=0.214715, std dev=0.00762514}
| | | | Utilities_kokkos::GetMatrixDiagonalInverse: 0.00520455 - 0.0173604% [26] {min=0.00477527, max=0.00571655, std dev=0.000372889}
| | | | Remainder: 29.227 - 97.4902%
| | | NOX Total Linear Solve: 71.1287 - 62.5118% [26] {min=71.1223, max=71.1448, std dev=0.00721082}
| | | | Stratimikos: BelosLOWS: 71.1286 - 99.9999% [26] {min=71.1222, max=71.1447, std dev=0.00721279}
| | | | | Belos: Operation Op*x: 0.0349336 - 0.0491133% [26] {min=0.0252733, max=0.0431341, std dev=0.00574483}
| | | | | Belos::MVT::MvAddMv: 0.00108912 - 0.00153119% [26] {min=0.00105318, max=0.00111496, std dev=2.09648e-05}
| | | | | Belos: BlockGmresSolMgr total solve time: 70.8739 - 99.6418% [26] {min=70.869, max=70.8795, std dev=0.0034408}
| | | | | | Belos: ICGS[2]: Orthogonalization: 14.826 - 20.9189% [4704] {min=13.4882, max=16.8103, std dev=1.05639}
| | | | | | | Belos: ICGS[2]: Ortho (Norm): 0.507524 - 3.42319% [4704] {min=0.425644, max=0.564937, std dev=0.0464142}
| | | | | | | | Belos::MVT::MvDot: 0.502425 - 98.9953% [4704] {min=0.420236, max=0.559946, std dev=0.0465581}
| | | | | | | | Remainder: 0.00509909 - 1.0047%
| | | | | | | Belos::MVT::MvScale: 0.139984 - 0.944174% [4704] {min=0.137723, max=0.146567, std dev=0.00291626}
| | | | | | | Belos: ICGS[2]: Ortho (Inner Product): 9.32726 - 62.9113% [9356] {min=7.87049, max=11.2943, std dev=1.08414}
| | | | | | | | Belos::MVT::MvTransMv: 9.31434 - 99.8614% [9356] {min=7.85725, max=11.2813, std dev=1.08421}
| | | | | | | | Remainder: 0.0129241 - 0.138562%
| | | | | | | Belos: ICGS[2]: Ortho (Update): 4.81648 - 32.4866% [9356] {min=4.73572, max=5.00937, std dev=0.0923146}
| | | | | | | | Belos::MVT::MvTimesMatAddMv: 4.80513 - 99.7644% [9356] {min=4.72442, max=4.99764, std dev=0.0921193}
| | | | | | | | Remainder: 0.0113498 - 0.235646%
| | | | | | | Remainder: 0.0347882 - 0.234643%
| | | | | | Belos::MVT::MvAddMv: 0.02343 - 0.0330587% [82] {min=0.0222952, max=0.0249771, std dev=0.000904079}
| | | | | | Belos::MVT::MvNorm: 0.0117902 - 0.0166355% [41] {min=0.00654316, max=0.0179576, std dev=0.00375614}
| | | | | | Belos: Operation Prec*x: 47.0646 - 66.4062% [4704] {min=45.8164, max=47.7714, std dev=0.700975}
| | | | | | | Ifpack2::Relaxation::apply: 29.9235 - 63.5796% [23520] {min=29.2512, max=30.8931, std dev=0.580336}
| | | | | | | Remainder: 17.1411 - 36.4204%
| | | | | | Belos: Operation Op*x: 7.08849 - 10.0016% [4693] {min=6.36331, max=7.81548, std dev=0.476847}
| | | | | | Belos::MVT::MvTimesMatAddMv: 0.044292 - 0.0624941% [26] {min=0.0434073, max=0.0456651, std dev=0.00070921}
| | | | | | Remainder: 1.81521 - 2.56118%
| | | | | Remainder: 0.218755 - 0.307548%
| | | | Remainder: 0.000100433 - 0.000141199%
| | | Remainder: 0.271469 - 0.238581%
@jewatkins @jhux2 Should we meet to discuss? Can we get more detailed timing? Thanks.
Sorry, I was OOO last week. I produced more timers but I don't think it's useful without detailed timings from before the merge. I'll try to produce that. One thing I noticed from the latest run is that Op*x
seems to be similar to before:
Belos: Operation Op*x: 5.54366 - 8.14925% [5099] {min=5.3545, max=5.65473, std dev=0.120893}
So, based on the above timings, it's probably safe to focus on muelu/ifpack2 ops during preconditioner construction and Belos: Operation Prec*x
I'm open to meeting to discuss further.
I attached the output between the two Trilinos commits (238862e (old) and 1e57a8f (new)), same Albany version on one case: output_old.txt output_new.txt
The largest performance regression is in "MueLu: SemiCoarsenPFactory" which is currently running on host. I'm probably not too worried about that since we have https://github.com/trilinos/Trilinos/pull/9107 which should fix this.
The next biggest difference might be "Ifpack2::Relaxation::apply:". I'm currently using "Two-stage Gauss-Seidel".
@jewatkins Thanks. @iyamazaki Do you know if there have been any changes in the two-stage GS algorithm that might account for the timing differences?
Thank you @jewatkins & @jhux2. I can take a look. Can you remind me how I can build Albany to run this test?
I just updated my Trilinos build, and I, too, am seeing MAJOR performance degradation in the examples I have been testing. My total solve time for one example increased from 22 seconds to 338 seconds. Again, this mysterious overhead seems to show in the preconditioner construction:
| | | NOX Total Preconditioner Construction: 322.437 - 95.971% [19]
| | | | MueLu: ParameterListInterpreter (ParameterList): 0.0336662 - 0.0104412% [19]
| | | | Ifpack2::Relaxation::initialize: 1.08852 - 0.337593% [57]
| | | | Ifpack2::Relaxation::compute: 0.143643 - 0.0445491% [57]
| | | | Utilities_kokkos::GetMatrixDiagonalInverse: 0.00389289 - 0.00120733% [19]
| | | | Remainder: 321.167 - 99.6062%
A few observations:
Ifpack2::Relaxation::apply:
(or at least, not in the portion that is captured by the timer.) That timer only runs 10 seconds in my build.@jennloe Can you also post the verbose timing log? It looked like the MueLu preconditioner setup was fine, but something in Nox, Stratimikos or the MueLu Stratimikos adapter takes forever.
Here are the verbose timings, along with the input parameters I was using.
input_ORIGINAL.txt
bugTestOrigB7_27.txt
I just updated my Trilinos build, and I, too, am seeing MAJOR performance degradation in the examples I have been testing.
@jennloe Do you happen to know the SHAs before/after your update? This is Trilinos develop, correct?
@jennloe It looks like you're using an old Albany (3/12). There's been some changes in Albany/Trilinos that might not be consistent and that could issues. It's probably better to pull the latest Trilinos/Albany (or pull a Trilinos/Albany that correspond to a similar date depending on what you're looking at).
That being said, I took a look at our performance tests again and it does look like there been some more performance regression... https://ikalash.github.io/ali/weaver_nightly_data/Ali_PerfTestsWeaver_07_27_2021.html but not in the order that you're seeing. Sometime between 7/19-7/20. It looks like there was some work in ifpack2/muelu during that time.
This is Trilinos develop. I now notice that the slowdown was in my previous Trilinos build, too. (I just didn't notice it at the time because I was only concerned with my linear solver timing.) The problem definitely existed by July 14. Looks like I was still having good runs with Trilnos commit ID db6073c
. The bad ones existed by the time I pulled 6f9eb6c
. (Kudos to the Albany person who made that print automatically.)
@jewatkins I'll pull an Albany update and see what difference that makes.
@jennloe I added that in for exactly this reason. :)
I think you'll need a new input file too (there's been changes to boundary condition inputs). Let me prep that for you.
@jewatkins Okay. Yes, running with the same old inputs gave the same crummy timings, even with an updated Albany build.
Hmm, I'm surprised the old input file worked. I'll send you the new ones in a bit and you can try that.
@jewatkins Just figured out that the Albany didn't finish rebuilding for some reason, so I was still running the old code. Trying again...
The Tpetra team looked over your timer logs and it seems that whatever's happening is in either NOX proper (unlikely), or one of NOX's callbacks into Albany which is not being timed (more likely). I would highly suggest finding those callbacks and timing them.
@GeoffDanielson and @tasmith4 will be looking into slowdowns seen in other test in insert*Values() and fillComplete() on CUDA. This might be what you're seeing.
@csiefer2 We changed the preconditioner to Ifpack2, and the issue went away. Maybe some of my MueLu adapter changes were bad?
@jennloe Try these new inputs (on weaver): /home/projects/albany/forJennifer/do-cmake-weaver-trilinos /home/projects/albany/forJennifer/input_albany_Velocity_MueLuKokkos_Wedge.yaml /home/projects/albany/forJennifer/mesh From the Albany-side, there's been a number of changes to the input file and maybe the mesh.
On the Trilinos-side, we had to turn on Tpetra_ENABLE_DEPRECATED_CODE
, Xpetra_ENABLE_DEPRECATED_CODE
. Also there may have been some changes in Piro which might have changed behavior if Trilinos_ENABLE_ROL
isn't on. I don't think that was the issue in this case.
For anyone interested in recreating the performance test results, I placed all the files for green-3-20km_ent_muk_wdg_np8
here (weaver):
/home/projects/albany/forPerfReg
weaver_modules.sh
do-cmake-weaver-trilinos
and nvcc_wrapper_volta
do-cmake-albany-cuda-sfad6
runMeshDecomp.sh
submitPopMesh.sh
submitMPIGPU2.sh
Some of these files will need to be edited to point to trilinos/albany builds.I'll turn on the muelu timer dumps, maybe that will help. I've been hesitant because it does affect performance but I think it's probably better moving forward.
I found the problem. (Well, at least, I found MY problem. Probably not related to the original issue after all.) I had been writing out the matrix to file during the preconditioner construction. Except I totally forgot I had added that. Ooops. Definitely was the source of the overhead.
Thanks for the input file updates! @jewatkins
Great, glad we were able to update your input files at least
Using the above setups from @jewatkins, I managed to reproduce the performance regression on Weaver.
For example,
Before (f01ad53)
:
MueLu: Hierarchy: Solve (level=0): 2.53608 - 21.7527% [3378] {min=2.38239, max=2.69378, std dev=0.101202}
| MueLu: Hierarchy: Solve : smoothing (level=0): 1.65991 - 65.4518% [3378] {min=1.57593, max=1.72553, std dev=0.0511428}
| | Ifpack2::Relaxation::apply: 1.57767 - 95.0456% [3378] {min=1.50262, max=1.63475, std dev=0.044526}
| | Remainder: 0.0822388 - 4.95441%
| MueLu: Hierarchy: Solve : residual calculation (level=0): 0.600572 - 23.6811% [1689] {min=0.550407, max=0.683911, std dev=0.0505804}
| MueLu: Hierarchy: Solve : restriction (level=0): 0.0993075 - 3.91578% [1689] {min=0.0901948, max=0.116675, std dev=0.00907973}
| MueLu: Hierarchy: Solve : prolongation (level=0): 0.129944 - 5.12382% [1689] {min=0.117676, max=0.149753, std dev=0.011351}
| Remainder: 0.0463468 - 1.8275%
After (f75748f)
:
MueLu: Hierarchy: Solve (level=0): 4.06912 - 30.8842% [3134] {min=3.90674, max=4.17618, std dev=0.0816196}
| MueLu: Hierarchy: Solve : smoothing (level=0): 2.29888 - 56.4958% [3134] {min=2.07543, max=2.40426, std dev=0.105199}
| | Ifpack2::Relaxation::apply: 2.22787 - 96.9109% [3134] {min=2.00352, max=2.33395, std dev=0.105859}
| | Remainder: 0.0710146 - 3.08909%
| MueLu: Hierarchy: Solve : residual calculation (level=0): 0.565375 - 13.8943% [1567] {min=0.498803, max=0.628931, std dev=0.0433468}
| MueLu: Hierarchy: Solve : restriction (level=0): 0.507241 - 12.4656% [1567] {min=0.49339, max=0.531523, std dev=0.0143033}
| MueLu: Hierarchy: Solve : prolongation (level=0): 0.656381 - 16.1308% [1567] {min=0.602185, max=0.7063, std dev=0.0346509}
| Remainder: 0.0412429 - 1.01356%
I am focusing on Ifpack2 time and am still looking. But the time difference seems to come from the Kokkos-Kernels (two-stage GS).
@jhux2 also pointed out the number of calls are different between the runs (3378 vs. 3134 applies, so it is called a fewer time and is still slower after the UVM removal). Even the non-linear iteration counts are different.
@kyungjoo-kim tagging mysefl.
@jewatkins @iyamazaki @jhux2 @kyungjoo-kim has agreed to take a look from the Tpetra side. But the notes above leave us a bit confused about current status and where we can add value to the investigation. The old.txt and new.txt files above have the same runtime, and nearly the same preconditioning construction time (with new.txt being faster). Can you point to the discrepancies you are seeing, and provide instructions for running the problematic test? Thanks!
Hi, @kddevin. I was looking at Ifpack2::apply and restriction/prolongation times in the comment (both of them are mainly SpMV, but I am not sure if the issue is related to SpMV). In the newer version, they are called a fewer times and is slower.
Thanks all for looking into this. Long story short - the reproducer Ichi has is a smaller problem compared to the original posted problem but it shows the same issues. The numbers Ichi posted does not show the original "faster" runtimes:
we were averaging ~6.4s/13s before. Nevertheless, Ichi did find a performance regression. I'm going to try to recover the original times to see whether we can find other regressions.
After running multiple tests, I was able to recover the original times. I think the main issue is that the number iterations changes between runs - even using the same trilinos/albany. If I fix Albany to 4612319
(the commit which saw the change in the plots above) and change trilinos and run multiple samples, I think I can see the trilinos performance regression (data.zip attached):
trilinos-661e06e times:
Preconditioner Construction: 7.87011, 7.25354, 7.40195
Linear Solve: 14.7691, 13.7277, 13.9867
trilinos-f01ad53 times:
Preconditioner Construction: 7.21695, 7.50475, 8.0839
Linear Solve: 13.6094, 14.2857, 15.2503
trilinos-f75748f times:
Preconditioner Construction: 8.55142, 8.25217, 9.23111
Linear Solve: 17.5545, 17.1627, 18.6639
These times are larger than what's shown in the plot in my previous comment because I set verbosity to extreme
but the location of the "jump" between f01ad53
and f75748f
is identical. I've updated the files in /home/projects/albany/forPerfReg
to be similar to what I used for these runs.
There might be too much noise in these simulations to be able to identify any other issues besides Ifpack2::Relaxation::apply
as Ichi mentioned above. Maybe someone can look at the verbose output I attached to see if anything else stands out. FYI, output2
is associated with trilinos-f01ad53 while output3
is associated with trilinos-f75748f times.
@jewatkins Are you still seeing performance differences?
@jhux2 I've made a few changes to these tests to improve performance including refactoring the semicoarsening (#9107) and tuning the smoothers (https://github.com/sandialabs/ali-perf-tests/commit/8ee16b5fd6b003f78ac9d55096b0150c70282f14) so it's difficult to see whether this regression still exists without some digging. It still existed before I made those changes but I'm not sure if it makes sense to try to reproduce and fix at this point.
I'm not seeing any regression in preconditioner construction or the linear solver on weaver for the larger problem since I introduced the two changes I mentioned above so I'll go ahead and close this for now.
Question
@trilinos/tpetra @kddevin
Albany Land Ice had some pretty substantial performance regressions on GPU after https://github.com/trilinos/Trilinos/commit/f75748fe7beb1dc9c935d0c7037cbbd5b83642bb
The following is just an example of preconditioner construction (muelu) which regressed by almost 2x:
Interactive plot is here: https://ikalash.github.io/ali/weaver_nightly_data/Ali_PerfTestsWeaver_06_11_2021.html
There's a few other places that use Tpetra that also saw regression. Has anyone else reported anything? Any ideas what could have caused them? Anyone else looking at it? @mcarlson801 and I wanted to get a bit more info on the changes.