Significant slowdown in Albany finite element assembly when running with Epetra

ikalash commented 5 years ago

I was doing some large performance runs for some Land Ice Antarctic problems and this revealed that there has been a significant degradation of the Albany finite element assembly when running with Epetra - it is more than 2x slower than with Tpetra. The problem seems to happen only for large enough test cases (4km, 2km, 1km AIS, not 8km AIS). Here is some sample data:

	ML	MueLu
Alb. Total Time	3196.6s	1577.5s
NOX Lin Solve Time	353.7s	299.8s
Alb. Total Fill Time	2676.7s	1116.8s
Number Nonlinear Iters	46	46
Number Linear Iters	607	455

More can be found in the attached slides. Recently @brian-kelley ran the same cases and got comparable timings for FEA with Epetra and Tpetra (see attached pdf). He was running an Albany from Jan. 23 (b5f3930) prior to a lot of the refactors we did using Thyra. He had a Trilinos from end of July (1f15cb62ca), so I'm pretty sure the issue is with Albany.

The nightlies do not show a significant change but the problems run there are small. Here's some data from the CDash site:

AIS MueLu test today: http://cdash.sandia.gov/CDash-2-3-0/testDetails.php?test=4261731&build=80588 AIS MueLu test 1/21: http://cdash.sandia.gov/CDash-2-3-0/testDetails.php?test=4893660&build=88781 AIS ML test today: http://cdash.sandia.gov/CDash-2-3-0/testDetails.php?test=4893659&build=88781 AIS ML test 1/21: http://cdash.sandia.gov/CDash-2-3-0/testDetails.php?test=4261730&build=80588

@bartgol , @mperego , @jewatkins : any idea on what might have happened? I don't really see how the Thyra refactor could have messed things up like this but maybe I'm wrong. @jewatkins , could any of the memoization changes you did have had this result (e.g. memoization stopped happening for Epetra)?

Another question is do we care about Epetra for ALI still? I'm still testing things but it looks like Muelu is faster or comparable than ML. If we will go with Tpetra moving further, we may not need to worry about this.

Line Smoothing Report.pdf ais_ml_vs_muelu_sept_2019.pdf

jewatkins commented 5 years ago

Is this with Kokkos Serial running a single rank per core?

Memoization isn't on by default so I don't think it's a factor in this case (unless you both added it to your input files for all runs).

If you both have the output from your runs, we can compare the teuchos timers to see whether the issue is in global assembly of the thyra vectors (i.e. Gather/Scatter) or something in Epetra. I don't think it could be any of the evaluation kernels since both Epetra/Tpetra run the same kernels with MDFields but it might be worth checking that too.

bartgol commented 5 years ago

I have no idea what the issue may be. But I agree with @jewatkins that we should look at the individual timers, to see if the problem is "in Albany" or "in Trilinos".

ikalash commented 5 years ago

Yes, this is Kokkos Serial. I believe it's done 1 rank per core, yes. No, I don't have memoization on in any of them runs.

I'm attaching a couple sample output for ML for the 2km run now vs. Jan. 2019. A lot of the evaluator computations went up, in particular gather/scatter:

Scatter Velocity Residual Now: 7.12 Jan 2019: 2.29

Gather Solution Now: 29.01 Jan 2019: 15.05

I guess the question is is this caused by changes in Albany or Trilinos? Allegedly the Jan. 2019 cases were run with a Trilinos from July but I'm actually a bit skeptical b/c when I tried creating this build I got compilation errors (I did not make the older runs, @brian-kelley did, but they were all done on Skybridge, with the same input file).

ml_256procs_aug2019.txt ml_256procs_jan2019.txt

jewatkins commented 5 years ago

It definitely has something to do with the global assembly of the Jacobian:

|   |   |   |   |   Phalanx: Evaluator 148: [<Jacobian>] Scatter Velocity Residual<Jacobian>: 43.3483 - 5.69594% [254533] {min=42.2678, max=43.9299, std dev=0.209977}

vs.

|   |   |   |   |   Phalanx: Evaluator 134: [<Jacobian>] Scatter Velocity Residual<Jacobian>: 1440.43 - 62.8891% [254533] {min=1265.49, max=1523.31, std dev=53.9058}

Looking at the history, I don't see anything that I added that could cause the issue.

ikalash commented 5 years ago

Seems I need to do bisection to figure out what happened. Please stay tuned...

ikalash commented 5 years ago

@jewatkins : following up on our discussion at the Albany meeting.

You see a degradation in the global assembly of the Scatter Velocity Residual<Jacobian> routine in the nightly AIS test with ML, but it's <2x slower:

Current: Scatter Velocity Residual: 4.14516 Jan. 2019: Scatter Velocity Residual: 2.26908

Here are the data for the coarser AIS problems I'm running on skybridge:

8km + 5 layers, 16 procs Current: 55.7928 Jan. 2019: 8.98766

4km + 10 layers, 64 procs Current: 267.132 Jan. 2019: 19.6344

So the larger the problem, the relatively slower it is, but you do see the slowdown on the coarser problems. This should make it easier to ID when the problem started.

ikalash commented 5 years ago

@mperego : @jewatkins mentioned that Matt H. is doing some MALI production runs right now. How old is the Albany in that build? Depending on how old it is, this issue may be slowing down those runs.

jewatkins commented 5 years ago

So the current nightly AIS Scatter Residual is faster? It looks like the 8km case is significant enough to see where the issue is.

ikalash commented 5 years ago

That was a typo - I swapped the dates for the AIS ML test. Current test is almost 2x slower. It's fixed now.

jewatkins commented 5 years ago

Okay so we may be able to find the change using cdash

ikalash commented 5 years ago

Yes... but I honestly think bisection will be easier. I'm working on that now.

ikalash commented 5 years ago

This is where performance tracking tools like what we're working on would be been very handy, if we had performance testing in Albany...

jewatkins commented 5 years ago

Very true. It might be worth throwing something together for at least the prospect AIS case.

ikalash commented 5 years ago

Absolutely!!!

ikalash commented 5 years ago

Here is more info on when this started. From my Linux machine camobap:

sha 5fb606fc6cbef74e287ffb1f2b40dac827c0433e: Scatter Velocity Residual: 9.67123 sha 05afbec18583a44379ae8d930c3a47d4d29880db: Scatter Velocity Residual: 23.2174

These are consecutive commits according to git log. However, there is a complication, as I believe the second commit was on the unpetrify_branch (which existed at the time) whereas the first was on master. They became consecutive commits after the merge of unpetrify_branch into master (I think).

Both the builds with the timings reported above use the same Trilinos, so the problem is definitely in Albany. Unfortunately it seems unpetrify_branch has been deleted, so I'm not sure how to best pinpoint what in that branch broke things, w/o manually going through the commits and trying to guess which ones were on unpetrify_branch and could have caused the slowdown. @bartgol : is the unpetrify_branch truly gone, or you have it somewhere like in a tag or fork?

bartgol commented 4 years ago

@ikalash all the history of the branch is there, inside master. There's no need of the branch itself (a branch is just a name attached to a commit, after all).

It is possible the unpetrify branch is indeed slowing down. For instance, every call to the local assembly is going to perform a dyn cast on the Thyra object, to retrieve the concrete [E|T]petra one.

Is the slowdown in a build that does not use the local matrix approach? If so, that may be due to the repeated dyn casts (there's one per workset, cell, node, so it may add up), and I speculate that the slowdown may disappear (or reduce remarkably) if one turns on the kokkos local matrix approach (even with a serial kokkos); in that approach, we get the local kokkos data once per workset only (and I think we should be able to do it just once per evaluation, if we started using the pre/post evaluate correctly).

jewatkins commented 4 years ago

Since the slowdown is in Epetra, it might have something to do with this chunk of code while obtaining the local matrix: https://github.com/SNLComputation/Albany/blob/f3963982a1df4895bca90e2b0f6c458c2ecafdb5/src/utility/Albany_ThyraUtils.cpp#L1204

before we would just do something like:

Tpetra_CrsMatrix::local_matrix_type JacT_kokkos = JacT->getLocalMatrix();

Maybe we can throw a some timers in there to find out?

bartgol commented 4 years ago

Another thing we could do is to put all the local data extraction in a pre-evaluate call. There's no need to do it for every workset. Or we could try to run with 1 workset, and see how noticeable it is...

ikalash commented 4 years ago

Here is some additional data when I change the workset side (in a fairly new build):

antarctica_8_quad + 5 layers 1ws: Scatter Velocity Residual: 13.7958 ws size = 50K: Scatter Velocity Residual: 13.815 ws size = 100K: Scatter Velocity Residual: 13.8002 ws size = 10K (default): Scatter Velocity Residual: 25.6713

That is an important symptom I think. The other one is I've pinpointed that the slowdown happened between these two commits:

commit a63d0afc9a0a8a5c5b67c630c1c5cf08fa98a4a0 (HEAD) Author: Luca Bertagna lbertag@sandia.gov Date: Thu Mar 7 17:50:18 2019 -0700

commit 57dcb84bc8d46a77a1891d85634933fc41da9dd8 (HEAD) Author: Luca Bertagna lbertag@sandia.gov Date: Thu Mar 21 08:37:23 2019 -0600

Fixing a bug in ThyraCrsMatrixFactory causing a hang.

Maybe we should all look through the commits b/w these 2 to try to figure out which is the culprit. @bartgol you might have some ideas, having written the relevant code having to do with the refactor.

bartgol commented 4 years ago

I'll inspect those commits, and see what I can find. It is interesting to notice how the assembly time is roughly constant until you get down to somewhere between 10k and 50k. Just for completeness, what architecture was this run on? Do you have the memory (L1+L2+global) specs for it? Would be interesting to see if memory also plays a role in the perf bump at 10k.

ikalash commented 4 years ago

@bartgol : if you could look at the commits, that would be great.

The numbers I reported above were from a run on cee-compute017, which has a TON of memory (1TB RAM, to be exact). Other specs can be found here: https://snl-wiki.sandia.gov/display/CEE9/SRN+High+Performance+Compute+Servers .

On other machines where I ran (my Linux workstation, skybridge), I ran out of memory when running with a workset size of > 50K. I saw similar trends on skybridge, when I ran from an allocated interactive job with a ws size that was small enough to fit in memory:

Default ws size (10K): Scatter Velocity Residual: 55.7928 ws size = 50K: Scatter Velocity Residual: 8.43335

Looks like skybridge has 64GB RAM/node: https://onestop.sandia.gov/platforms/sky_bridge .

It's interesting that the the default ws size case is about 2x slower than on the CEE, which I don't have an explanation for right now.

ikalash commented 4 years ago

@bartgol : Also, let me know if you want me to point you to my input files. You could probably see the issue on one of the test cases in the repo, but it may not be as pronounced.

bartgol commented 4 years ago

That's ok, @mperego is also experiencing slow Epetra results, so I think this is across the board.

Another thing I realized is that Epetra uses 32bits integers, which means that reinterpret_cast does not work from Albany's GO to Epetra_GO, so we need to hard copy gids. Not sure if this is a source of slowdown, but worth keeping in mind.

ikalash commented 4 years ago

Yes, I was wondering if the production land ice runs Matt H. is doing might be affected... I think likely they are. I'm not sure if those are the slow Epetra results @mperego is experiencing, or if that's something else.

mperego commented 4 years ago

Things seem to be worse for the case where the mesh coords are FAD types. Anyway I suspect that Matt H. run are affected as well.

ikalash commented 4 years ago

@bartgol : per our discussion yesterday, I put in timers around the following calls in PHAL_ScatterResidual_Def.hpp: https://github.com/SNLComputation/Albany/blob/master/src/evaluators/scatter/PHAL_ScatterResidual_Def.hpp#L558-L561 and compared the printout to the Scatter Jacobian timer (https://github.com/SNLComputation/Albany/blob/master/src/evaluators/scatter/PHAL_ScatterResidual_Def.hpp#L610-L616). Here is the output (on 1 proc):

getNonconstDeviceData resid time = 0 0 getNonconstDeviceData Jac time = 4 4721 Scatter Jacobian time = 4 4997

(the first entry is seconds, the second is milliseconds).

So, it seems @jewatkins 's theory is correct that the Epetra getNonconstDeviceData call is what is killing us. I can repeat the exercise for Tpetra just to see how things look there after my meetings this morning.

bartgol commented 4 years ago

Whoa! I was hoping it was that (I had no other ideas), but I am still surprised it costs so much.

Anyhow, I will work on a fix.

You can try to see what the timings are for tpetra, but I am pretty confident you should not see any big number, since Tpetra already has the local matrix built.

ikalash commented 4 years ago

Here are the same numbers with Tpetra:

getNonconstDeviceData resid time = 0 0 getNonconstDeviceData Jac time = 0 3 Scatter Jacobian time = 0 222

So, I think it's safe to say the Epetra version of getNonconstDeviceData is the culprit. Thanks for looking at this @bartgol . I'm happy to test your fix when it's ready.

ikalash commented 4 years ago

Seems just from CDash that your fix @bartgol made some difference. Here is a set of times for the Scatter Velocity Residual operation for the FO_AIS_ML problem:

1/21/19: 2.26908 (http://cdash.sandia.gov/CDash-2-3-0/testDetails.php?test=4261730&build=80588) 9/7/19: 4.14516 (http://cdash.sandia.gov/CDash-2-3-0/testDetails.php?test=4893659&build=88781) 9/25/19: 2.00783 (http://cdash.sandia.gov/CDash-2-3-0/testDetails.php?test=4943869&build=89236)

I need to run my longer tests to very things some more. Will post back soon.

ikalash commented 4 years ago

I've verified that the issue has been resolved. Here is some data from a 4km AIS run with 10 vertical layers for the Albany Total Fill Time:

Epetra: 472.8s Tpetra: 477.9s

Thanks again @bartgol for fixing this!

bartgol commented 4 years ago

Great!

sandialabs / Albany

Significant slowdown in Albany finite element assembly when running with Epetra #519