Tpetra: performance regression after UVM removal merge

jewatkins commented 3 years ago

Question

@trilinos/tpetra @kddevin

Albany Land Ice had some pretty substantial performance regressions on GPU after https://github.com/trilinos/Trilinos/commit/f75748fe7beb1dc9c935d0c7037cbbd5b83642bb

The following is just an example of preconditioner construction (muelu) which regressed by almost 2x:

Interactive plot is here: https://ikalash.github.io/ali/weaver_nightly_data/Ali_PerfTestsWeaver_06_11_2021.html

There's a few other places that use Tpetra that also saw regression. Has anyone else reported anything? Any ideas what could have caused them? Anyone else looking at it? @mcarlson801 and I wanted to get a bit more info on the changes.

kddevin commented 3 years ago

@jewatkins Thank you for reporting the issue. Can you tell me the difference between the "muk" tests and other tests (which don't appear to have this regression)? It would help to know what parts of Tpetra are being exercised; what are the preconditioner settings? We have added checks for needed synchronization in some routines that access DualViews; in one case, we saw that using these routines in tight loops degraded performance (e.g., #9241). We can check the preconditioner for similar usage, or look for other problems in Tpetra.

jhux2 commented 3 years ago

@trilinos/tpetra @trilinos/muelu

@jewatkins Could you post some more detailed timers from MueLu and Belos?

jewatkins commented 3 years ago

The "muk" tests use the kokkos path in muelu while the other tests "fea" are just finite element assembly (no solve). There is a small regression in the assembly:

which I've narrowed down to something related to how we apply our Dirichlet BC: Phalanx: Evaluator 286: [] DBC on NS top for DOF Enth prescribe Field: 0.929203 vs. Phalanx: Evaluator 286: [] DBC on NS top for DOF Enth prescribe Field: 4.3484 I need to narrow it down further though.

Preconditioner settings are here: https://github.com/ikalash/ali-perf-tests/blob/master/perf_tests/green-1-10km/input_albany_Enthalpy_MueLuKokkos_Tetrahedron.yaml#L227

I'll look for Tpetra calls in loops.

Unfortunately, I don't have the muelu timers for these runs (I keep forgetting to turn them on). This is what I have: Before:

|   |   |   NOX Total Preconditioner Construction: 18.8447 - 20.9274% [25] {min=18.827, max=18.8737, std dev=0.0138322}
|   |   |   |   MueLu: ParameterListInterpreter (ParameterList): 0.0489341 - 0.25967% [25] {min=0.0463495, max=0.0512412, std dev=0.00195169}
|   |   |   |   Ifpack2::Relaxation::initialize: 0.396602 - 2.10458% [75] {min=0.382286, max=0.425869, std dev=0.0159126}
|   |   |   |   Ifpack2::Relaxation::compute: 0.47467 - 2.51885% [75] {min=0.442856, max=0.549974, std dev=0.0427481}
|   |   |   |   Utilities_kokkos::GetMatrixDiagonalInverse: 0.00501845 - 0.0266305% [25] {min=0.00423258, max=0.00565916, std dev=0.00053107}
|   |   |   |   Remainder: 17.9195 - 95.0903%
|   |   |   NOX Total Linear Solve: 62.3684 - 69.2612% [25] {min=62.3427, max=62.3867, std dev=0.0123821}
|   |   |   |   Stratimikos: BelosLOWS: 62.3683 - 99.9998% [25] {min=62.3426, max=62.3866, std dev=0.0123841}
|   |   |   |   |   Belos: Operation Op*x: 0.031208 - 0.0500382% [25] {min=0.0145054, max=0.0506949, std dev=0.0128977}
|   |   |   |   |   Belos::MVT::MvAddMv: 0.00100305 - 0.00160828% [25] {min=0.000953609, max=0.00104134, std dev=3.91312e-05}
|   |   |   |   |   Belos: BlockGmresSolMgr total solve time: 62.1199 - 99.6017% [25] {min=62.1055, max=62.1347, std dev=0.0132035}
|   |   |   |   |   |   Belos: ICGS[2]: Orthogonalization: 14.9014 - 23.9882% [4647] {min=13.2244, max=16.6988, std dev=1.10671}
|   |   |   |   |   |   |   Belos: ICGS[2]: Ortho (Norm): 0.507179 - 3.40356% [4647] {min=0.38541, max=0.564136, std dev=0.0576391}
|   |   |   |   |   |   |   |   Belos::MVT::MvDot: 0.501827 - 98.9448% [4647] {min=0.379972, max=0.558914, std dev=0.0576883}
|   |   |   |   |   |   |   |   Remainder: 0.00535174 - 1.0552%
|   |   |   |   |   |   |   Belos::MVT::MvScale: 0.139604 - 0.936851% [4647] {min=0.134938, max=0.148437, std dev=0.00422436}
|   |   |   |   |   |   |   Belos: ICGS[2]: Ortho (Inner Product): 9.216 - 61.8464% [9244] {min=7.59734, max=10.9913, std dev=1.10174}
|   |   |   |   |   |   |   |   Belos::MVT::MvTransMv: 9.20311 - 99.8602% [9244] {min=7.58426, max=10.9782, std dev=1.10181}
|   |   |   |   |   |   |   |   Remainder: 0.0128867 - 0.13983%
|   |   |   |   |   |   |   Belos: ICGS[2]: Ortho (Update): 5.00399 - 33.5806% [9244] {min=4.93607, max=5.18445, std dev=0.0910986}
|   |   |   |   |   |   |   |   Belos::MVT::MvTimesMatAddMv: 4.99255 - 99.7715% [9244] {min=4.92473, max=5.17285, std dev=0.0910147}
|   |   |   |   |   |   |   |   Remainder: 0.0114333 - 0.228484%
|   |   |   |   |   |   |   Remainder: 0.0346597 - 0.232593%
|   |   |   |   |   |   Belos::MVT::MvAddMv: 0.0235671 - 0.0379381% [76] {min=0.0208356, max=0.0314318, std dev=0.00460369}
|   |   |   |   |   |   Belos::MVT::MvNorm: 0.0156675 - 0.0252214% [38] {min=0.00431255, max=0.0235406, std dev=0.00719162}
|   |   |   |   |   |   Belos: Operation Prec*x: 39.9978 - 64.3881% [4647] {min=38.9702, max=40.9678, std dev=0.605597}
|   |   |   |   |   |   |   Ifpack2::Relaxation::apply: 28.1191 - 70.3015% [23235] {min=27.675, max=28.9219, std dev=0.537843}
|   |   |   |   |   |   |   Remainder: 11.8788 - 29.6985%
|   |   |   |   |   |   Belos: Operation Op*x: 5.27984 - 8.49945% [4635] {min=4.50163, max=6.05698, std dev=0.554454}
|   |   |   |   |   |   Belos::MVT::MvTimesMatAddMv: 0.0450522 - 0.0725247% [25] {min=0.0428648, max=0.0514716, std dev=0.00330187}
|   |   |   |   |   |   Remainder: 1.85649 - 2.98855%
|   |   |   |   |   Remainder: 0.216228 - 0.346696%
|   |   |   |   Remainder: 9.63243e-05 - 0.000154444%
|   |   |   Remainder: 0.26288 - 0.291932%

After:

|   |   |   NOX Total Preconditioner Construction: 29.9794 - 26.3475% [26] {min=29.9747, max=29.9835, std dev=0.0033793}
|   |   |   |   MueLu: ParameterListInterpreter (ParameterList): 0.0505766 - 0.168705% [26] {min=0.0482862, max=0.0530588, std dev=0.00236452}
|   |   |   |   Ifpack2::Relaxation::initialize: 0.499058 - 1.66467% [78] {min=0.491649, max=0.50641, std dev=0.00554256}
|   |   |   |   Ifpack2::Relaxation::compute: 0.197572 - 0.659026% [78] {min=0.191227, max=0.214715, std dev=0.00762514}
|   |   |   |   Utilities_kokkos::GetMatrixDiagonalInverse: 0.00520455 - 0.0173604% [26] {min=0.00477527, max=0.00571655, std dev=0.000372889}
|   |   |   |   Remainder: 29.227 - 97.4902%
|   |   |   NOX Total Linear Solve: 71.1287 - 62.5118% [26] {min=71.1223, max=71.1448, std dev=0.00721082}
|   |   |   |   Stratimikos: BelosLOWS: 71.1286 - 99.9999% [26] {min=71.1222, max=71.1447, std dev=0.00721279}
|   |   |   |   |   Belos: Operation Op*x: 0.0349336 - 0.0491133% [26] {min=0.0252733, max=0.0431341, std dev=0.00574483}
|   |   |   |   |   Belos::MVT::MvAddMv: 0.00108912 - 0.00153119% [26] {min=0.00105318, max=0.00111496, std dev=2.09648e-05}
|   |   |   |   |   Belos: BlockGmresSolMgr total solve time: 70.8739 - 99.6418% [26] {min=70.869, max=70.8795, std dev=0.0034408}
|   |   |   |   |   |   Belos: ICGS[2]: Orthogonalization: 14.826 - 20.9189% [4704] {min=13.4882, max=16.8103, std dev=1.05639}
|   |   |   |   |   |   |   Belos: ICGS[2]: Ortho (Norm): 0.507524 - 3.42319% [4704] {min=0.425644, max=0.564937, std dev=0.0464142}
|   |   |   |   |   |   |   |   Belos::MVT::MvDot: 0.502425 - 98.9953% [4704] {min=0.420236, max=0.559946, std dev=0.0465581}
|   |   |   |   |   |   |   |   Remainder: 0.00509909 - 1.0047%
|   |   |   |   |   |   |   Belos::MVT::MvScale: 0.139984 - 0.944174% [4704] {min=0.137723, max=0.146567, std dev=0.00291626}
|   |   |   |   |   |   |   Belos: ICGS[2]: Ortho (Inner Product): 9.32726 - 62.9113% [9356] {min=7.87049, max=11.2943, std dev=1.08414}
|   |   |   |   |   |   |   |   Belos::MVT::MvTransMv: 9.31434 - 99.8614% [9356] {min=7.85725, max=11.2813, std dev=1.08421}
|   |   |   |   |   |   |   |   Remainder: 0.0129241 - 0.138562%
|   |   |   |   |   |   |   Belos: ICGS[2]: Ortho (Update): 4.81648 - 32.4866% [9356] {min=4.73572, max=5.00937, std dev=0.0923146}
|   |   |   |   |   |   |   |   Belos::MVT::MvTimesMatAddMv: 4.80513 - 99.7644% [9356] {min=4.72442, max=4.99764, std dev=0.0921193}
|   |   |   |   |   |   |   |   Remainder: 0.0113498 - 0.235646%
|   |   |   |   |   |   |   Remainder: 0.0347882 - 0.234643%
|   |   |   |   |   |   Belos::MVT::MvAddMv: 0.02343 - 0.0330587% [82] {min=0.0222952, max=0.0249771, std dev=0.000904079}
|   |   |   |   |   |   Belos::MVT::MvNorm: 0.0117902 - 0.0166355% [41] {min=0.00654316, max=0.0179576, std dev=0.00375614}
|   |   |   |   |   |   Belos: Operation Prec*x: 47.0646 - 66.4062% [4704] {min=45.8164, max=47.7714, std dev=0.700975}
|   |   |   |   |   |   |   Ifpack2::Relaxation::apply: 29.9235 - 63.5796% [23520] {min=29.2512, max=30.8931, std dev=0.580336}
|   |   |   |   |   |   |   Remainder: 17.1411 - 36.4204%
|   |   |   |   |   |   Belos: Operation Op*x: 7.08849 - 10.0016% [4693] {min=6.36331, max=7.81548, std dev=0.476847}
|   |   |   |   |   |   Belos::MVT::MvTimesMatAddMv: 0.044292 - 0.0624941% [26] {min=0.0434073, max=0.0456651, std dev=0.00070921}
|   |   |   |   |   |   Remainder: 1.81521 - 2.56118%
|   |   |   |   |   Remainder: 0.218755 - 0.307548%
|   |   |   |   Remainder: 0.000100433 - 0.000141199%
|   |   |   Remainder: 0.271469 - 0.238581%

kddevin commented 3 years ago

@jewatkins @jhux2 Should we meet to discuss? Can we get more detailed timing? Thanks.

jewatkins commented 3 years ago

Sorry, I was OOO last week. I produced more timers but I don't think it's useful without detailed timings from before the merge. I'll try to produce that. One thing I noticed from the latest run is that Op*x seems to be similar to before:

Belos: Operation Op*x: 5.54366 - 8.14925% [5099] {min=5.3545, max=5.65473, std dev=0.120893}

So, based on the above timings, it's probably safe to focus on muelu/ifpack2 ops during preconditioner construction and Belos: Operation Prec*x

I'm open to meeting to discuss further.

jewatkins commented 3 years ago

I attached the output between the two Trilinos commits (238862e (old) and 1e57a8f (new)), same Albany version on one case: output_old.txt output_new.txt

The largest performance regression is in "MueLu: SemiCoarsenPFactory" which is currently running on host. I'm probably not too worried about that since we have https://github.com/trilinos/Trilinos/pull/9107 which should fix this.

The next biggest difference might be "Ifpack2::Relaxation::apply:". I'm currently using "Two-stage Gauss-Seidel".

jhux2 commented 3 years ago

@jewatkins Thanks. @iyamazaki Do you know if there have been any changes in the two-stage GS algorithm that might account for the timing differences?

iyamazaki commented 3 years ago

Thank you @jewatkins & @jhux2. I can take a look. Can you remind me how I can build Albany to run this test?

jennloe commented 3 years ago

I just updated my Trilinos build, and I, too, am seeing MAJOR performance degradation in the examples I have been testing. My total solve time for one example increased from 22 seconds to 338 seconds. Again, this mysterious overhead seems to show in the preconditioner construction:

|   |   |   NOX Total Preconditioner Construction: 322.437 - 95.971% [19]
|   |   |   |   MueLu: ParameterListInterpreter (ParameterList): 0.0336662 - 0.0104412% [19]
|   |   |   |   Ifpack2::Relaxation::initialize: 1.08852 - 0.337593% [57]
|   |   |   |   Ifpack2::Relaxation::compute: 0.143643 - 0.0445491% [57]
|   |   |   |   Utilities_kokkos::GetMatrixDiagonalInverse: 0.00389289 - 0.00120733% [19]
|   |   |   |   Remainder: 321.167 - 99.6062%

A few observations:

The problem is NOT in the Ifpack2::Relaxation::apply: (or at least, not in the portion that is captured by the timer.) That timer only runs 10 seconds in my build.
I am seeing this slowdown with both a Chebyshev smoother and a MT Gauss-Seidel smoother. So this is probably not a problem with two-stage Gauss-Seidel.

cgcgcg commented 3 years ago

@jennloe Can you also post the verbose timing log? It looked like the MueLu preconditioner setup was fine, but something in Nox, Stratimikos or the MueLu Stratimikos adapter takes forever.

jennloe commented 3 years ago

Here are the verbose timings, along with the input parameters I was using.
input_ORIGINAL.txt bugTestOrigB7_27.txt

jhux2 commented 3 years ago

I just updated my Trilinos build, and I, too, am seeing MAJOR performance degradation in the examples I have been testing.

@jennloe Do you happen to know the SHAs before/after your update? This is Trilinos develop, correct?

jewatkins commented 3 years ago

@jennloe It looks like you're using an old Albany (3/12). There's been some changes in Albany/Trilinos that might not be consistent and that could issues. It's probably better to pull the latest Trilinos/Albany (or pull a Trilinos/Albany that correspond to a similar date depending on what you're looking at).

That being said, I took a look at our performance tests again and it does look like there been some more performance regression... https://ikalash.github.io/ali/weaver_nightly_data/Ali_PerfTestsWeaver_07_27_2021.html but not in the order that you're seeing. Sometime between 7/19-7/20. It looks like there was some work in ifpack2/muelu during that time.

jennloe commented 3 years ago

This is Trilinos develop. I now notice that the slowdown was in my previous Trilinos build, too. (I just didn't notice it at the time because I was only concerned with my linear solver timing.) The problem definitely existed by July 14. Looks like I was still having good runs with Trilnos commit ID db6073c. The bad ones existed by the time I pulled 6f9eb6c. (Kudos to the Albany person who made that print automatically.)

jennloe commented 3 years ago

@jewatkins I'll pull an Albany update and see what difference that makes.

jewatkins commented 3 years ago

@jennloe I added that in for exactly this reason. :)

jewatkins commented 3 years ago

I think you'll need a new input file too (there's been changes to boundary condition inputs). Let me prep that for you.

jennloe commented 3 years ago

@jewatkins Okay. Yes, running with the same old inputs gave the same crummy timings, even with an updated Albany build.

jewatkins commented 3 years ago

Hmm, I'm surprised the old input file worked. I'll send you the new ones in a bit and you can try that.

jennloe commented 3 years ago

@jewatkins Just figured out that the Albany didn't finish rebuilding for some reason, so I was still running the old code. Trying again...

csiefer2 commented 3 years ago

The Tpetra team looked over your timer logs and it seems that whatever's happening is in either NOX proper (unlikely), or one of NOX's callbacks into Albany which is not being timed (more likely). I would highly suggest finding those callbacks and timing them.

@GeoffDanielson and @tasmith4 will be looking into slowdowns seen in other test in insert*Values() and fillComplete() on CUDA. This might be what you're seeing.

cgcgcg commented 3 years ago

@csiefer2 We changed the preconditioner to Ifpack2, and the issue went away. Maybe some of my MueLu adapter changes were bad?

jewatkins commented 3 years ago

@jennloe Try these new inputs (on weaver): /home/projects/albany/forJennifer/do-cmake-weaver-trilinos /home/projects/albany/forJennifer/input_albany_Velocity_MueLuKokkos_Wedge.yaml /home/projects/albany/forJennifer/mesh From the Albany-side, there's been a number of changes to the input file and maybe the mesh.

On the Trilinos-side, we had to turn on Tpetra_ENABLE_DEPRECATED_CODE, Xpetra_ENABLE_DEPRECATED_CODE. Also there may have been some changes in Piro which might have changed behavior if Trilinos_ENABLE_ROL isn't on. I don't think that was the issue in this case.

jewatkins commented 3 years ago

For anyone interested in recreating the performance test results, I placed all the files for green-3-20km_ent_muk_wdg_np8 here (weaver): /home/projects/albany/forPerfReg

load modules in weaver_modules.sh
configure/build/install trilinos with do-cmake-weaver-trilinos and nvcc_wrapper_volta
configure/build albany with do-cmake-albany-cuda-sfad6
Decompose mesh with runMeshDecomp.sh
Populate mesh with submitPopMesh.sh
Run case with submitMPIGPU2.sh Some of these files will need to be edited to point to trilinos/albany builds.

I'll turn on the muelu timer dumps, maybe that will help. I've been hesitant because it does affect performance but I think it's probably better moving forward.

jennloe commented 3 years ago

I found the problem. (Well, at least, I found MY problem. Probably not related to the original issue after all.) I had been writing out the matrix to file during the preconditioner construction. Except I totally forgot I had added that. Ooops. Definitely was the source of the overhead.

Thanks for the input file updates! @jewatkins

jewatkins commented 3 years ago

Great, glad we were able to update your input files at least

iyamazaki commented 3 years ago

Using the above setups from @jewatkins, I managed to reproduce the performance regression on Weaver.

old.txt new.txt

For example,

Before (f01ad53):

MueLu: Hierarchy: Solve (level=0): 2.53608 - 21.7527% [3378] {min=2.38239, max=2.69378, std dev=0.101202}
   |   MueLu: Hierarchy: Solve : smoothing (level=0): 1.65991 - 65.4518% [3378] {min=1.57593, max=1.72553, std dev=0.0511428}
   |   |   Ifpack2::Relaxation::apply: 1.57767 - 95.0456% [3378] {min=1.50262, max=1.63475, std dev=0.044526}
   |   |   Remainder: 0.0822388 - 4.95441%
   |   MueLu: Hierarchy: Solve : residual calculation (level=0): 0.600572 - 23.6811% [1689] {min=0.550407, max=0.683911, std dev=0.0505804}
   |   MueLu: Hierarchy: Solve : restriction (level=0): 0.0993075 - 3.91578% [1689] {min=0.0901948, max=0.116675, std dev=0.00907973}
   |   MueLu: Hierarchy: Solve : prolongation (level=0): 0.129944 - 5.12382% [1689] {min=0.117676, max=0.149753, std dev=0.011351}
   |   Remainder: 0.0463468 - 1.8275%

After (f75748f):

MueLu: Hierarchy: Solve (level=0): 4.06912 - 30.8842% [3134] {min=3.90674, max=4.17618, std dev=0.0816196}
   |   MueLu: Hierarchy: Solve : smoothing (level=0): 2.29888 - 56.4958% [3134] {min=2.07543, max=2.40426, std dev=0.105199}
   |   |   Ifpack2::Relaxation::apply: 2.22787 - 96.9109% [3134] {min=2.00352, max=2.33395, std dev=0.105859}
   |   |   Remainder: 0.0710146 - 3.08909%
   |   MueLu: Hierarchy: Solve : residual calculation (level=0): 0.565375 - 13.8943% [1567] {min=0.498803, max=0.628931, std dev=0.0433468}
   |   MueLu: Hierarchy: Solve : restriction (level=0): 0.507241 - 12.4656% [1567] {min=0.49339, max=0.531523, std dev=0.0143033}
   |   MueLu: Hierarchy: Solve : prolongation (level=0): 0.656381 - 16.1308% [1567] {min=0.602185, max=0.7063, std dev=0.0346509}
   |   Remainder: 0.0412429 - 1.01356%

I am focusing on Ifpack2 time and am still looking. But the time difference seems to come from the Kokkos-Kernels (two-stage GS).

@jhux2 also pointed out the number of calls are different between the runs (3378 vs. 3134 applies, so it is called a fewer time and is still slower after the UVM removal). Even the non-linear iteration counts are different.

kyungjoo-kim commented 3 years ago

@kyungjoo-kim tagging mysefl.

kddevin commented 3 years ago

@jewatkins @iyamazaki @jhux2 @kyungjoo-kim has agreed to take a look from the Tpetra side. But the notes above leave us a bit confused about current status and where we can add value to the investigation. The old.txt and new.txt files above have the same runtime, and nearly the same preconditioning construction time (with new.txt being faster). Can you point to the discrepancies you are seeing, and provide instructions for running the problematic test? Thanks!

iyamazaki commented 3 years ago

Hi, @kddevin. I was looking at Ifpack2::apply and restriction/prolongation times in the comment (both of them are mainly SpMV, but I am not sure if the issue is related to SpMV). In the newer version, they are called a fewer times and is slower.

jewatkins commented 3 years ago

Thanks all for looking into this. Long story short - the reproducer Ichi has is a smaller problem compared to the original posted problem but it shows the same issues. The numbers Ichi posted does not show the original "faster" runtimes:

we were averaging ~6.4s/13s before. Nevertheless, Ichi did find a performance regression. I'm going to try to recover the original times to see whether we can find other regressions.

jewatkins commented 3 years ago

After running multiple tests, I was able to recover the original times. I think the main issue is that the number iterations changes between runs - even using the same trilinos/albany. If I fix Albany to 4612319 (the commit which saw the change in the plots above) and change trilinos and run multiple samples, I think I can see the trilinos performance regression (data.zip attached):

trilinos-661e06e times:
Preconditioner Construction: 7.87011, 7.25354, 7.40195
Linear Solve: 14.7691, 13.7277, 13.9867

trilinos-f01ad53 times: 
Preconditioner Construction: 7.21695, 7.50475, 8.0839
Linear Solve: 13.6094, 14.2857, 15.2503

trilinos-f75748f times: 
Preconditioner Construction: 8.55142, 8.25217, 9.23111
Linear Solve: 17.5545, 17.1627, 18.6639

These times are larger than what's shown in the plot in my previous comment because I set verbosity to extreme but the location of the "jump" between f01ad53 and f75748f is identical. I've updated the files in /home/projects/albany/forPerfReg to be similar to what I used for these runs.

There might be too much noise in these simulations to be able to identify any other issues besides Ifpack2::Relaxation::apply as Ichi mentioned above. Maybe someone can look at the verbose output I attached to see if anything else stands out. FYI, output2 is associated with trilinos-f01ad53 while output3 is associated with trilinos-f75748f times.

jhux2 commented 2 years ago

@jewatkins Are you still seeing performance differences?

jewatkins commented 2 years ago

@jhux2 I've made a few changes to these tests to improve performance including refactoring the semicoarsening (#9107) and tuning the smoothers (https://github.com/sandialabs/ali-perf-tests/commit/8ee16b5fd6b003f78ac9d55096b0150c70282f14) so it's difficult to see whether this regression still exists without some digging. It still existed before I made those changes but I'm not sure if it makes sense to try to reproduce and fix at this point.

I'm not seeing any regression in preconditioner construction or the linear solver on weaver for the larger problem since I introduced the two changes I mentioned above so I'll go ahead and close this for now.

trilinos / Trilinos

Tpetra: performance regression after UVM removal merge #9275

Question