Closed tingyang2004 closed 5 years ago
Yes the communication between nodes is the problem. The model is over decomposed and the communication overhead is significant relative to the compute (individual CPU) work. That's why time goes up the more CPUs you use.
Unfortunately it is difficult to give you a good "rule of thumb" to get it right for CPU work vs parallel decomposition. It depends on the model, solver setup and hardware platform. For a 1210x202 model i would guess around 4-12 procs would be good.
Thanks, Julian. Hmmm, that is strange. In rifting models I have tried before (also with very complex and non-linear rheology), 20x20 elements per CPU can be much faster than 50x50 elements. I am wondering why 50x50 elements per CPU is much slower than 100x100 elements per CPU in this case.
what usually works with me for A subduction model is the following:
32^3= 32768 number of cells —>1-2 (prescribed) to 8 cpu. 2cpu
1210*210 /32768 -----> 8 to 16 cpus.
Increasing the numbers of cores not necessarily increases your performance, of course it also depends by your model complexity and so on.
On Mon, May 27, 2019 at 4:43 PM tingyang2004 notifications@github.com wrote:
Thanks, Julian. Hmmm, that is strange. In rifting models I have tried before (also with very complex and non-linear rheology), 20x20 elements per CPU can be much faster than 50x50 elements. I am wondering why 50x50 elements per CPU is much slower than 100x100 elements per CPU in this case.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/underworldcode/underworld2/issues/385?email_source=notifications&email_token=AFMPHXKGJC7ZQ5BZYNWPVGLPXN7JNA5CNFSM4HPX2QV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWI5VYQ#issuecomment-496098018, or mute the thread https://github.com/notifications/unsubscribe-auth/AFMPHXPHC45LAY3XDZYMZU3PXN7JNANCNFSM4HPX2QVQ .
Thanks, Roberta,
I have reduced the model size to 605x129 elements. Two number of cores are investigated for the same script. For 24 cores, that is, 51x65 elements in each core, it takes around 9 min to finish the first 10 steps. For 12 cores, that is, 101x65 elements per core, it takes around 16 min to finish the first 10 steps. One possible explanation is that with the increasing total number of elements, the preferred number of elements per core also increases.
by the way, can you really calculate 32**3 elements in one CPU? I thought it would take hrs to run for even one step.
Ting I'm not sure what solver setup you're using, but MG will generally be optimal when you choose resolutions which are powers of 2. And you should avoid primes. So instead of 604x129, you should use perhaps 512x128.
You might also consider using MUMPS.
On Mon, May 27, 2019 at 5:29 PM tingyang2004 notifications@github.com wrote:
Thanks, Roberta,
I have reduced the model size to 605x129 elements. Two number of cores are investigated for the same script. For 24 cores, that is, 51x65 elements in each core, it takes around 9 min to finish the first 10 steps. For 12 cores, that is, 101x65 elements per core, it takes around 18 min to finish the first 10 steps. One possible explanation is that with the increasing number of elements, the preferred number of elements per core also increases.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/underworldcode/underworld2/issues/385?email_source=notifications&email_token=AAK7NHPSJYPTZWSEQQNJ32LPXOEVJA5CNFSM4HPX2QV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWJAQBI#issuecomment-496109573, or mute the thread https://github.com/notifications/unsubscribe-auth/AAK7NHP6E4ME5FXF7BXILM3PXOEVJANCNFSM4HPX2QVQ .
Thanks John. MUMPS is used for the stokes solver, so the power of 2 rule is not needed. I guess MG is still much less efficient than MUMPS in 2D models? Not sure what you mean by prime.
Prime numbers, which are the worst for MG. In any case, I believe you are correct that it shouldn't matter for MUMPS.
I would suggest then you do also try MG for comparison.
Thanks, I'll have another try and see MG is more efficient than MUMPS for this case.
Well, my recent tests for 3D subduction models (264x96x64 elements) do suggest that 32x32x32 elements per core does the most parallel efficiency. On the other hand, for 2D subduction models, 50x50 to 100x100 elements per core does the most parallel efficiency depending on different mesh resolutions. Will close it since the doubts have been clarified now.
What is approx your total BSSCR Linear solve time per time step and how many non-linear iterations do you usually need for a 264x96x64 elements model?
On Wed, Jun 19, 2019 at 2:33 PM tingyang2004 notifications@github.com wrote:
Well, my recent tests for 3D subduction models (264x96x64 elements) do suggest that 32x32x32 elements per core does the most parallel efficiency. On the other hand, for 2D subduction models, 50x50 to 100x100 elements per core does the most parallel efficiency depending on different mesh resolutions. Will close it since the doubts have been clarified now.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/underworldcode/underworld2/issues/385?email_source=notifications&email_token=AFMPHXKAZAICC45Q32PM75LP3GZKXA5CNFSM4HPX2QV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYAU5XA#issuecomment-503402204, or mute the thread https://github.com/notifications/unsubscribe-auth/AFMPHXLWEDYP3HQIJLFCD53P3GZKXANCNFSM4HPX2QVQ .
It takes around 15 non-linear iterations to get converged for the 0th step, and 1-2 non-linear iterations to get converged for the following steps, similar to 2D cases. But each non-linear iteration is pretty slow, it costs around 10 min with 32x32x32 elements per core, so around 20 min for each time step. The first step of the model used is attached below showing the rheology (diffusion, dislocation, yielding) used. It would be very helpful if the convergence rate can be accelerated a bit.
Hi @tingyang2004, thanks a lot for guiding with the timing and procs counts, could you let us know what is the non-linear solver tolerance that your are using?
solver.options.scr.ksp_rtol=1.0e-6 for the linear iteration and solver.solve(nonLinearIterate=True,nonLinearTolerance=0.01) for nonlinear iteration, if that helps. @arijitlaik
thanks :)
@tingyang2004 are these results for MUMPS? Did you end up trying MG for comparison?
All 3D models use mg. One interesting finding is that 33x32x32 elements per cpu can be several times faster than 32x32x32 elements per cpu. maybe due to the interaction between the complex rheology and mesh resolution.
Really? That is indeed 'interesting'. 🤨
Forget about the words below. There was an issue with the system and the convergence speed does increase with increasing number of cores on a different system, although I did not check if 32^3 per core is significantly slower than 33x32x32 per core.
Another interesting thing is that, for a 256x96x64 model, it takes around 40 min for each step with 48 cores, i.e., 32^3 elements per core. When I double the number of cores to 96 or even to 192, the convergence rate is getting slower. However, when I double the number of cores again to 384, it takes around 4 min for each step, or around 10 times faster than 48 cores. Seems a lot of bizarre things there.
That is very strange @tingyang2004. Might be worth also checking how repeatable those timings are.
What machine are you running on? Are the hardware architecture details available?
Below is the first paragraph from the User's Manual @julesghub @jmansour Will check the performance again later today.
“TaiYi” is a supercomputer based on Intel Xeon Gold processors from the Skylake generation. It is a Lenovo system composed of SD530 Compute Racks, an Intel Omni-Path high performance network interconnect and running RedHat Linux Enterprise Server as operating system. Its current Linpack Rmax Performance is 1.67 Petafops
Hi Ting,
Would you mind sharing a screen shot of your velocity field at the first step? I'm just curious to look at how it looks like. I'll start running my 3d models again next week and then we can compare results. What are the size of model domain (coords)?
On Wed, Jun 19, 2019 at 4:34 PM tingyang2004 notifications@github.com wrote:
It takes around 15 non-linear iterations to get converged for the 0th step, and 1-2 non-linear iterations to get converged for the following steps, similar to 2D cases. But each non-linear iteration is pretty slow, it costs around 10 min with 32x32x32 elements per core, so around 20 min for each time step. The first step of the model used is attached below showing the rheology (diffusion, dislocation, yielding) used. It would be very helpful if the convergence rate can be accelerated a bit.
[image: SlabSubduction3D2AStep0] https://user-images.githubusercontent.com/26615840/59741771-84ae1500-929e-11e9-8ec8-6617b0fdbdd2.png
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/underworldcode/underworld2/issues/385?email_source=notifications&email_token=AFMPHXMKBW3F5FK3OCQD45TP3HHQRA5CNFSM4HPX2QV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYA27LQ#issuecomment-503426990, or mute the thread https://github.com/notifications/unsubscribe-auth/AFMPHXLZNLV6BCS32YFGY63P3HHQRANCNFSM4HPX2QVQ .
The velocity field shown above is for the first (0th) time step although I did not scale it with velocity magnitude and so only direction information is meaningful. The domain size is 1500 km deep x 3000 km long x 1500 km half wide.
And am happy to compre the results with you. @rcarluccio
did you send another screenshot? I can only visualise this https://user-images.githubusercontent.com/26615840/59741771-84ae1500-929e-11e9-8ec8-6617b0fdbdd2.png . Thanks for the info I'll reproduce something similar.
On Thu, Jun 20, 2019 at 5:27 PM tingyang2004 notifications@github.com wrote:
And am happy to compre the results with you. @rcarluccio https://github.com/rcarluccio
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/underworldcode/underworld2/issues/385?email_source=notifications&email_token=AFMPHXLWYKT5S4FAYGXU3OTP3MWM5A5CNFSM4HPX2QV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYERDIY#issuecomment-503910819, or mute the thread https://github.com/notifications/unsubscribe-auth/AFMPHXPWOXEPHRBCX6QJFCLP3MWM5ANCNFSM4HPX2QVQ .
No, this is what I meant.
Note that I have skipped the subduction initiation period, as always.
For the issue that 33x32x32 elements per cpu is one time faster than 32x32x32 elements per cpu. I have checked the log file today. It looks like the pressure solver is one time slower for 32^3 CPUs.
33x32x32 model:
Non linear solver - iteration 14
Linear solver (XQ96L7RB__system-execute)
[1] SROpGenerator_SimpleCoarserLevel: time = 1.83635e-02
Setting schur_pc to "gkgdiag"
SCR Solver Summary:
Multigrid setup: = 0.242 secs
RHS V Solve: = 6.193 secs / 18 its
Pressure Solve: = 212.4 secs / 48 its
Final V Solve: = 5.804 secs / 20 its
Total BSSCR Linear solve time: 236.770290 seconds
Linear solver (XQ96L7RB__system-execute), solution time 2.369147e+02 (secs)
In func SystemLinearEquations_NonLinearExecute: Iteration 14 of 500 - Residual 0.0080322 - Tolerance = 0.01
Non linear solver - Residual 8.03220671e-03; Tolerance 1.0000e-02 - Converged - 4.615428e+03 (secs)
In func SystemLinearEquations_NonLinearExecute: Converged after 14 iterations.
^[[1;35m
Pressure iterations: 48
Velocity iterations: 18 (presolve)
Velocity iterations: 755 (pressure solve)
Velocity iterations: 20 (backsolve)
Velocity iterations: 793 (total solve)
SCR RHS setup time: 1.1543e+01
SCR RHS solve time: 6.1928e+00
Pressure setup time: 6.2304e-03
Pressure solve time: 2.1236e+02
Velocity setup time: 9.5367e-07 (backsolve)
Velocity solve time: 5.8037e+00 (backsolve)
Total solve time : 2.3677e+02
Velocity solution min/max: 0.0000e+00/0.0000e+00
Pressure solution min/max: 0.0000e+00/0.0000e+00
^[[00m
('hgn 12: ', datetime.datetime(2019, 6, 19, 19, 17, 31, 464663))
step = 0; time = 0.00000e+00; time_Myr = 0.00000e+00
32x32x32 model:
Non linear solver - iteration 14
Linear solver (DJ25GQCN__system-execute)
BSSCR -- Block Stokes Schur Compliment Reduction Solver
AUGMENTED LAGRANGIAN K2 METHOD - Penalty = 0.000000
SROpGenerator_SimpleFinestLevel: time = 2.39016e-01
[4] SROpGenerator_SimpleCoarserLevel: time = 4.37264e-02
[3] SROpGenerator_SimpleCoarserLevel: time = 1.78812e-02
[2] SROpGenerator_SimpleCoarserLevel: time = 1.48602e-02
[1] SROpGenerator_SimpleCoarserLevel: time = 1.41797e-02
Setting schur_pc to "gkgdiag"
SCR Solver Summary:
Multigrid setup: = 0.3395 secs
RHS V Solve: = 17.6 secs / 20 its
Pressure Solve: = 640.1 secs / 47 its
Final V Solve: = 16.97 secs / 20 its
Total BSSCR Linear solve time: 679.924037 seconds
Linear solver (DJ25GQCN__system-execute), solution time 6.800786e+02 (secs)
In func SystemLinearEquations_NonLinearExecute: Iteration 14 of 500 - Residual 0.0083099 - Tolerance = 0.01
Non linear solver - Residual 8.30994138e-03; Tolerance 1.0000e-02 - Converged - 9.481746e+03 (secs)
In func SystemLinearEquations_NonLinearExecute: Converged after 14 iterations.
^[[1;35m
Pressure iterations: 47
Velocity iterations: 20 (presolve)
Velocity iterations: 747 (pressure solve)
Velocity iterations: 20 (backsolve)
Velocity iterations: 787 (total solve)
SCR RHS setup time: 4.3717e+00
SCR RHS solve time: 1.7601e+01
Pressure setup time: 6.0389e-03
Pressure solve time: 6.4010e+02
Velocity setup time: 7.1526e-07 (backsolve)
Velocity solve time: 1.6970e+01 (backsolve)
Total solve time : 6.7992e+02
Velocity solution min/max: 0.0000e+00/0.0000e+00
Pressure solution min/max: 0.0000e+00/0.0000e+00
^[[00m
('hgn 12: ', datetime.datetime(2019, 6, 19, 15, 59, 3, 87266))
step = 0; time = 0.00000e+00; time_Myr = 0.00000e+00
I have attached the model viscosity and velocity solution at 0_th step for these two models below. They demonstrate very close results except at the trench part where the low viscosity in 32^3 case is a bit more blurry. 33x32x32 case:
32x32x32 case:
I will attribute this convergence rate difference to coincidence by now, but will keep an eye on it.
Best regards, Ting
I don’t understand those numbers … can you clarify “one time faster / slower”
L
@lmoresi, I think he means twice as fast for 33x32x32 local resolution run (although it's closer to three times as fast). ~Same global resolution~. So
33x32x32:
Multigrid setup: = 0.242 secs
RHS V Solve: = 6.193 secs / 18 its
Pressure Solve: = 212.4 secs / 48 its
Final V Solve: = 5.804 secs / 20 its
Total BSSCR Linear solve time: 236.770290 seconds
32x32x32:
Multigrid setup: = 0.3395 secs
RHS V Solve: = 17.6 secs / 20 its
Pressure Solve: = 640.1 secs / 47 its
Final V Solve: = 16.97 secs / 20 its
Total BSSCR Linear solve time: 679.924037 seconds
@tingyang2004 did you re-run the 32x32x32 run? If not, can you so that we can see if these timings are repeatable. They're certainly very peculiar.
Louis and John,
I think 32x32x32 per core model is slower than 33x32x32 in this specific case, so I guess it is just a coincidence.
32x32x32 per core model (256x96x64 elements in total with 48 cores) takes around 20 min for each linear iteration or around 40 min for each step. 33x32x32 per core model (264x96x64 elements in total with 48 cores) takes around 10 min for each linear iteration or around 20 min for each step. These results seem repeatable on one server
Log files for these two models are attached in case you need more information. input2A2.out_bk.txt input2C.out_bk.txt
Ok, sorry, it's different global resolution (of course). Still very strange.
I'll close this ticket as I think it's run its course. Feel free to reopen.
Hi all,
I am running a slab subduction model with a model resolution of 1210x202. Periodic boundary condition is applied for the side boundaries (a snapshot of model evolution is as below). When I use 24 cores, that is, around 101x101 elements in each core, it cost me around 50 min to finish the first 30 steps. Since I am planning to run the model for over 10 thousand steps, I tried to increase the core number to speed up the model. However, when I use 48 cores, that is, around 76x68 elements in each core, it cost me around 46 min to finish the first 30 steps, less than 10% faster than 24 cores. When I use 96 cores, around 51x51 elements in each core, it cost me even around 80 min to finish the first 30 steps. I am wondering what is causing the slower convergence rate increasing the number of cores to 96? Is it communication between different nodes? Any suggestion I can speed up the model? A part of the stokes solver information is demonstrated below, you can see that pressure iteration gets much slower when I use more cores: 24 Cores:
48 Cores:
96 Cores:
Thanks a lot and let me know if you need more information. Ting