underworldcode / underworld2

underworld2: A parallel, particle-in-cell, finite element code for Geodynamics.
http://www.underworldcode.org/
Other
168 stars 58 forks source link

slower pressure iteration convergence rate for more cores #385

Closed tingyang2004 closed 5 years ago

tingyang2004 commented 5 years ago

Hi all,

I am running a slab subduction model with a model resolution of 1210x202. Periodic boundary condition is applied for the side boundaries (a snapshot of model evolution is as below). When I use 24 cores, that is, around 101x101 elements in each core, it cost me around 50 min to finish the first 30 steps. Since I am planning to run the model for over 10 thousand steps, I tried to increase the core number to speed up the model. However, when I use 48 cores, that is, around 76x68 elements in each core, it cost me around 46 min to finish the first 30 steps, less than 10% faster than 24 cores. When I use 96 cores, around 51x51 elements in each core, it cost me even around 80 min to finish the first 30 steps. I am wondering what is causing the slower convergence rate increasing the number of cores to 96? Is it communication between different nodes? Any suggestion I can speed up the model? A part of the stokes solver information is demonstrated below, you can see that pressure iteration gets much slower when I use more cores: 24 Cores: 24Cores

48 Cores: 48Cores

96 Cores: 96Cores

Thanks a lot and let me know if you need more information. Ting

julesghub commented 5 years ago

Yes the communication between nodes is the problem. The model is over decomposed and the communication overhead is significant relative to the compute (individual CPU) work. That's why time goes up the more CPUs you use.

Unfortunately it is difficult to give you a good "rule of thumb" to get it right for CPU work vs parallel decomposition. It depends on the model, solver setup and hardware platform. For a 1210x202 model i would guess around 4-12 procs would be good.

tingyang2004 commented 5 years ago

Thanks, Julian. Hmmm, that is strange. In rifting models I have tried before (also with very complex and non-linear rheology), 20x20 elements per CPU can be much faster than 50x50 elements. I am wondering why 50x50 elements per CPU is much slower than 100x100 elements per CPU in this case.

rcarluccio commented 5 years ago

what usually works with me for A subduction model is the following:

32^3= 32768 number of cells —>1-2 (prescribed) to 8 cpu. 2cpu

1210*210 /32768 -----> 8 to 16 cpus.

Increasing the numbers of cores not necessarily increases your performance, of course it also depends by your model complexity and so on.

On Mon, May 27, 2019 at 4:43 PM tingyang2004 notifications@github.com wrote:

Thanks, Julian. Hmmm, that is strange. In rifting models I have tried before (also with very complex and non-linear rheology), 20x20 elements per CPU can be much faster than 50x50 elements. I am wondering why 50x50 elements per CPU is much slower than 100x100 elements per CPU in this case.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/underworldcode/underworld2/issues/385?email_source=notifications&email_token=AFMPHXKGJC7ZQ5BZYNWPVGLPXN7JNA5CNFSM4HPX2QV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWI5VYQ#issuecomment-496098018, or mute the thread https://github.com/notifications/unsubscribe-auth/AFMPHXPHC45LAY3XDZYMZU3PXN7JNANCNFSM4HPX2QVQ .

tingyang2004 commented 5 years ago

Thanks, Roberta,

I have reduced the model size to 605x129 elements. Two number of cores are investigated for the same script. For 24 cores, that is, 51x65 elements in each core, it takes around 9 min to finish the first 10 steps. For 12 cores, that is, 101x65 elements per core, it takes around 16 min to finish the first 10 steps. One possible explanation is that with the increasing total number of elements, the preferred number of elements per core also increases.

by the way, can you really calculate 32**3 elements in one CPU? I thought it would take hrs to run for even one step.

jmansour commented 5 years ago

Ting I'm not sure what solver setup you're using, but MG will generally be optimal when you choose resolutions which are powers of 2. And you should avoid primes. So instead of 604x129, you should use perhaps 512x128.

You might also consider using MUMPS.

On Mon, May 27, 2019 at 5:29 PM tingyang2004 notifications@github.com wrote:

Thanks, Roberta,

I have reduced the model size to 605x129 elements. Two number of cores are investigated for the same script. For 24 cores, that is, 51x65 elements in each core, it takes around 9 min to finish the first 10 steps. For 12 cores, that is, 101x65 elements per core, it takes around 18 min to finish the first 10 steps. One possible explanation is that with the increasing number of elements, the preferred number of elements per core also increases.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/underworldcode/underworld2/issues/385?email_source=notifications&email_token=AAK7NHPSJYPTZWSEQQNJ32LPXOEVJA5CNFSM4HPX2QV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWJAQBI#issuecomment-496109573, or mute the thread https://github.com/notifications/unsubscribe-auth/AAK7NHP6E4ME5FXF7BXILM3PXOEVJANCNFSM4HPX2QVQ .

tingyang2004 commented 5 years ago

Thanks John. MUMPS is used for the stokes solver, so the power of 2 rule is not needed. I guess MG is still much less efficient than MUMPS in 2D models? Not sure what you mean by prime.

jmansour commented 5 years ago

Prime numbers, which are the worst for MG. In any case, I believe you are correct that it shouldn't matter for MUMPS.

I would suggest then you do also try MG for comparison.

tingyang2004 commented 5 years ago

Thanks, I'll have another try and see MG is more efficient than MUMPS for this case.

tingyang2004 commented 5 years ago

Well, my recent tests for 3D subduction models (264x96x64 elements) do suggest that 32x32x32 elements per core does the most parallel efficiency. On the other hand, for 2D subduction models, 50x50 to 100x100 elements per core does the most parallel efficiency depending on different mesh resolutions. Will close it since the doubts have been clarified now.

rcarluccio commented 5 years ago

What is approx your total BSSCR Linear solve time per time step and how many non-linear iterations do you usually need for a 264x96x64 elements model?

On Wed, Jun 19, 2019 at 2:33 PM tingyang2004 notifications@github.com wrote:

Well, my recent tests for 3D subduction models (264x96x64 elements) do suggest that 32x32x32 elements per core does the most parallel efficiency. On the other hand, for 2D subduction models, 50x50 to 100x100 elements per core does the most parallel efficiency depending on different mesh resolutions. Will close it since the doubts have been clarified now.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/underworldcode/underworld2/issues/385?email_source=notifications&email_token=AFMPHXKAZAICC45Q32PM75LP3GZKXA5CNFSM4HPX2QV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYAU5XA#issuecomment-503402204, or mute the thread https://github.com/notifications/unsubscribe-auth/AFMPHXLWEDYP3HQIJLFCD53P3GZKXANCNFSM4HPX2QVQ .

tingyang2004 commented 5 years ago

It takes around 15 non-linear iterations to get converged for the 0th step, and 1-2 non-linear iterations to get converged for the following steps, similar to 2D cases. But each non-linear iteration is pretty slow, it costs around 10 min with 32x32x32 elements per core, so around 20 min for each time step. The first step of the model used is attached below showing the rheology (diffusion, dislocation, yielding) used. It would be very helpful if the convergence rate can be accelerated a bit.

SlabSubduction3D2AStep0

arijitlaik commented 5 years ago

Hi @tingyang2004, thanks a lot for guiding with the timing and procs counts, could you let us know what is the non-linear solver tolerance that your are using?

tingyang2004 commented 5 years ago

solver.options.scr.ksp_rtol=1.0e-6 for the linear iteration and solver.solve(nonLinearIterate=True,nonLinearTolerance=0.01) for nonlinear iteration, if that helps. @arijitlaik

arijitlaik commented 5 years ago

thanks :)

jmansour commented 5 years ago

@tingyang2004 are these results for MUMPS? Did you end up trying MG for comparison?

tingyang2004 commented 5 years ago

All 3D models use mg. One interesting finding is that 33x32x32 elements per cpu can be several times faster than 32x32x32 elements per cpu. maybe due to the interaction between the complex rheology and mesh resolution.

jmansour commented 5 years ago

Really? That is indeed 'interesting'. 🤨

tingyang2004 commented 5 years ago

Forget about the words below. There was an issue with the system and the convergence speed does increase with increasing number of cores on a different system, although I did not check if 32^3 per core is significantly slower than 33x32x32 per core.

Another interesting thing is that, for a 256x96x64 model, it takes around 40 min for each step with 48 cores, i.e., 32^3 elements per core. When I double the number of cores to 96 or even to 192, the convergence rate is getting slower. However, when I double the number of cores again to 384, it takes around 4 min for each step, or around 10 times faster than 48 cores. Seems a lot of bizarre things there.

jmansour commented 5 years ago

That is very strange @tingyang2004. Might be worth also checking how repeatable those timings are.

julesghub commented 5 years ago

What machine are you running on? Are the hardware architecture details available?

tingyang2004 commented 5 years ago

Below is the first paragraph from the User's Manual @julesghub @jmansour Will check the performance again later today.

“TaiYi” is a supercomputer based on Intel Xeon Gold processors from the Skylake generation. It is a Lenovo system composed of SD530 Compute Racks, an Intel Omni-Path high performance network interconnect and running RedHat Linux Enterprise Server as operating system. Its current Linpack Rmax Performance is 1.67 Petafops

rcarluccio commented 5 years ago

Hi Ting,

Would you mind sharing a screen shot of your velocity field at the first step? I'm just curious to look at how it looks like. I'll start running my 3d models again next week and then we can compare results. What are the size of model domain (coords)?

On Wed, Jun 19, 2019 at 4:34 PM tingyang2004 notifications@github.com wrote:

It takes around 15 non-linear iterations to get converged for the 0th step, and 1-2 non-linear iterations to get converged for the following steps, similar to 2D cases. But each non-linear iteration is pretty slow, it costs around 10 min with 32x32x32 elements per core, so around 20 min for each time step. The first step of the model used is attached below showing the rheology (diffusion, dislocation, yielding) used. It would be very helpful if the convergence rate can be accelerated a bit.

[image: SlabSubduction3D2AStep0] https://user-images.githubusercontent.com/26615840/59741771-84ae1500-929e-11e9-8ec8-6617b0fdbdd2.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/underworldcode/underworld2/issues/385?email_source=notifications&email_token=AFMPHXMKBW3F5FK3OCQD45TP3HHQRA5CNFSM4HPX2QV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYA27LQ#issuecomment-503426990, or mute the thread https://github.com/notifications/unsubscribe-auth/AFMPHXLZNLV6BCS32YFGY63P3HHQRANCNFSM4HPX2QVQ .

tingyang2004 commented 5 years ago

The velocity field shown above is for the first (0th) time step although I did not scale it with velocity magnitude and so only direction information is meaningful. The domain size is 1500 km deep x 3000 km long x 1500 km half wide.

tingyang2004 commented 5 years ago

And am happy to compre the results with you. @rcarluccio

rcarluccio commented 5 years ago

did you send another screenshot? I can only visualise this https://user-images.githubusercontent.com/26615840/59741771-84ae1500-929e-11e9-8ec8-6617b0fdbdd2.png . Thanks for the info I'll reproduce something similar.

On Thu, Jun 20, 2019 at 5:27 PM tingyang2004 notifications@github.com wrote:

And am happy to compre the results with you. @rcarluccio https://github.com/rcarluccio

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/underworldcode/underworld2/issues/385?email_source=notifications&email_token=AFMPHXLWYKT5S4FAYGXU3OTP3MWM5A5CNFSM4HPX2QV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYERDIY#issuecomment-503910819, or mute the thread https://github.com/notifications/unsubscribe-auth/AFMPHXPWOXEPHRBCX6QJFCLP3MWM5ANCNFSM4HPX2QVQ .

tingyang2004 commented 5 years ago

No, this is what I meant.

tingyang2004 commented 5 years ago

Note that I have skipped the subduction initiation period, as always.

tingyang2004 commented 5 years ago

For the issue that 33x32x32 elements per cpu is one time faster than 32x32x32 elements per cpu. I have checked the log file today. It looks like the pressure solver is one time slower for 32^3 CPUs.

33x32x32 model:

Non linear solver - iteration 14
Linear solver (XQ96L7RB__system-execute)

  [1] SROpGenerator_SimpleCoarserLevel: time = 1.83635e-02
  Setting schur_pc to "gkgdiag"

SCR Solver Summary:

  Multigrid setup:        = 0.242 secs
  RHS V Solve:            = 6.193 secs / 18 its
  Pressure Solve:         = 212.4 secs / 48 its
  Final V Solve:          = 5.804 secs / 20 its

  Total BSSCR Linear solve time: 236.770290 seconds

Linear solver (XQ96L7RB__system-execute), solution time 2.369147e+02 (secs)
In func SystemLinearEquations_NonLinearExecute: Iteration 14 of 500 - Residual 0.0080322 - Tolerance = 0.01
Non linear solver - Residual 8.03220671e-03; Tolerance 1.0000e-02 - Converged - 4.615428e+03 (secs)

In func SystemLinearEquations_NonLinearExecute: Converged after 14 iterations.
^[[1;35m

Pressure iterations:  48
Velocity iterations:  18 (presolve)
Velocity iterations: 755 (pressure solve)
Velocity iterations:  20 (backsolve)
Velocity iterations: 793 (total solve)

SCR RHS  setup time: 1.1543e+01
SCR RHS  solve time: 6.1928e+00
Pressure setup time: 6.2304e-03
Pressure solve time: 2.1236e+02
Velocity setup time: 9.5367e-07 (backsolve)
Velocity solve time: 5.8037e+00 (backsolve)
Total solve time   : 2.3677e+02

Velocity solution min/max: 0.0000e+00/0.0000e+00
Pressure solution min/max: 0.0000e+00/0.0000e+00

^[[00m
('hgn 12: ', datetime.datetime(2019, 6, 19, 19, 17, 31, 464663))
step =      0; time = 0.00000e+00; time_Myr = 0.00000e+00

32x32x32 model:

Non linear solver - iteration 14
Linear solver (DJ25GQCN__system-execute)

BSSCR -- Block Stokes Schur Compliment Reduction Solver
AUGMENTED LAGRANGIAN K2 METHOD - Penalty = 0.000000

SROpGenerator_SimpleFinestLevel: time = 2.39016e-01
  [4] SROpGenerator_SimpleCoarserLevel: time = 4.37264e-02
  [3] SROpGenerator_SimpleCoarserLevel: time = 1.78812e-02
  [2] SROpGenerator_SimpleCoarserLevel: time = 1.48602e-02
  [1] SROpGenerator_SimpleCoarserLevel: time = 1.41797e-02
  Setting schur_pc to "gkgdiag"

SCR Solver Summary:

  Multigrid setup:        = 0.3395 secs
  RHS V Solve:            = 17.6 secs / 20 its
  Pressure Solve:         = 640.1 secs / 47 its
  Final V Solve:          = 16.97 secs / 20 its

  Total BSSCR Linear solve time: 679.924037 seconds

Linear solver (DJ25GQCN__system-execute), solution time 6.800786e+02 (secs)
In func SystemLinearEquations_NonLinearExecute: Iteration 14 of 500 - Residual 0.0083099 - Tolerance = 0.01
Non linear solver - Residual 8.30994138e-03; Tolerance 1.0000e-02 - Converged - 9.481746e+03 (secs)

In func SystemLinearEquations_NonLinearExecute: Converged after 14 iterations.
^[[1;35m

Pressure iterations:  47
Velocity iterations:  20 (presolve)
Velocity iterations: 747 (pressure solve)
Velocity iterations:  20 (backsolve)
Velocity iterations: 787 (total solve)

SCR RHS  setup time: 4.3717e+00
SCR RHS  solve time: 1.7601e+01
Pressure setup time: 6.0389e-03
Pressure solve time: 6.4010e+02
Velocity setup time: 7.1526e-07 (backsolve)
Velocity solve time: 1.6970e+01 (backsolve)
Total solve time   : 6.7992e+02

Velocity solution min/max: 0.0000e+00/0.0000e+00
Pressure solution min/max: 0.0000e+00/0.0000e+00

^[[00m
('hgn 12: ', datetime.datetime(2019, 6, 19, 15, 59, 3, 87266))
step =      0; time = 0.00000e+00; time_Myr = 0.00000e+00

I have attached the model viscosity and velocity solution at 0_th step for these two models below. They demonstrate very close results except at the trench part where the low viscosity in 32^3 case is a bit more blurry. 33x32x32 case: SlabSubduction3D2A2Step0

32x32x32 case: SlabSubduction3D2CStep0

I will attribute this convergence rate difference to coincidence by now, but will keep an eye on it.

Best regards, Ting

lmoresi commented 5 years ago

I don’t understand those numbers … can you clarify “one time faster / slower”

L

jmansour commented 5 years ago

@lmoresi, I think he means twice as fast for 33x32x32 local resolution run (although it's closer to three times as fast). ~Same global resolution~. So

33x32x32:

  Multigrid setup:        = 0.242 secs
  RHS V Solve:            = 6.193 secs / 18 its
  Pressure Solve:         = 212.4 secs / 48 its
  Final V Solve:          = 5.804 secs / 20 its

  Total BSSCR Linear solve time: 236.770290 seconds

32x32x32:

  Multigrid setup:        = 0.3395 secs
  RHS V Solve:            = 17.6 secs / 20 its
  Pressure Solve:         = 640.1 secs / 47 its
  Final V Solve:          = 16.97 secs / 20 its

  Total BSSCR Linear solve time: 679.924037 seconds

@tingyang2004 did you re-run the 32x32x32 run? If not, can you so that we can see if these timings are repeatable. They're certainly very peculiar.

tingyang2004 commented 5 years ago

Louis and John,

I think 32x32x32 per core model is slower than 33x32x32 in this specific case, so I guess it is just a coincidence.

32x32x32 per core model (256x96x64 elements in total with 48 cores) takes around 20 min for each linear iteration or around 40 min for each step. 33x32x32 per core model (264x96x64 elements in total with 48 cores) takes around 10 min for each linear iteration or around 20 min for each step. These results seem repeatable on one server

tingyang2004 commented 5 years ago

Log files for these two models are attached in case you need more information. input2A2.out_bk.txt input2C.out_bk.txt

jmansour commented 5 years ago

Ok, sorry, it's different global resolution (of course). Still very strange.

jmansour commented 5 years ago

I'll close this ticket as I think it's run its course. Feel free to reopen.