hypre-space / hypre

Parallel solvers for sparse linear systems featuring multigrid methods.
https://www.llnl.gov/casc/hypre/
Other
695 stars 191 forks source link

Elasticity suite on HIP #601

Open jedbrown opened 2 years ago

jedbrown commented 2 years ago

I'm trying to tune a BoomerAMG configuration (via PETSc) to run on HIP. I'm using hypre-2.24 and rocm-5.0.2. I believe I'm matching the MFEM elasticity configuration, as explained in this paper. I have two issues: the iteration counts and condition numbers are higher than expected and the setup time is huge compared to the default configuration.

Default options

Just HYPRE_BoomerAMGSetNumFunctions to 3 and HYPRE_BoomerAMGSetInterpVectors to the rigid body modes.

    0 SNES Function norm 6.871762231729e-03
    Linear solve converged due to CONVERGED_RTOL iterations 18
Iteratively computed extreme singular values: max 0.99667 min 0.0366636 max/min 27.1842
    1 SNES Function norm 3.782412646902e-02
    Linear solve converged due to CONVERGED_RTOL iterations 15
Iteratively computed extreme singular values: max 0.996443 min 0.0460206 max/min 21.6521
    2 SNES Function norm 1.195210490733e-03
    Linear solve converged due to CONVERGED_RTOL iterations 26
Iteratively computed extreme singular values: max 0.999151 min 0.0190504 max/min 52.4478
    3 SNES Function norm 1.368932216139e-04
    Linear solve converged due to CONVERGED_RTOL iterations 22
Iteratively computed extreme singular values: max 0.998763 min 0.0132859 max/min 75.1744
    4 SNES Function norm 1.270933979359e-07
    Linear solve converged due to CONVERGED_RTOL iterations 29
Iteratively computed extreme singular values: max 0.998951 min 0.01209 max/min 82.6262
    5 SNES Function norm 1.072381496388e-10
    Linear solve converged due to CONVERGED_RTOL iterations 34
Iteratively computed extreme singular values: max 0.999746 min 0.0120261 max/min 83.1315
    6 SNES Function norm 1.071094244724e-13
  Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 6

Iteration counts above are a bit higher than GAMG. Condition numbers for GAMG stay in the range 17 to 28 throughout this solve, but GAMG setup time is enough higher that Hypre has a slight edge depending on node and job launch configuration (but less robustness as seen in the CG condition number estimates).

Elasticity suite

(This is the coarse problem of a matrix-free p-MG. The p-MG is very reliable. I can show the reduced, but this is what I care about and it's qualitatively the same for the linear discretization.)

-mg_coarse_pc_hypre_boomeramg_nodal_coarsen 4
-mg_coarse_pc_hypre_boomeramg_nodal_coarsen_diag 1
-mg_coarse_pc_hypre_boomeramg_relax_type_coarse l1scaled-SOR/Jacobi
-mg_coarse_pc_hypre_boomeramg_vec_interp_variant 2
-mg_coarse_pc_hypre_boomeramg_vec_interp_qmax 4
-mg_coarse_pc_hypre_boomeramg_vec_interp_smooth
-mg_coarse_pc_hypre_boomeramg_interp_refine 1

Setup time is over 100x longer than seen above. I ran a much smaller problem because I'm impatient and iteration counts and condition numbers are much higher.

    0 SNES Function norm 2.290654724037e-03
    Linear solve converged due to CONVERGED_RTOL iterations 56
Iteratively computed extreme singular values: max 0.999729 min 0.00227529 max/min 439.385
    1 SNES Function norm 1.016780701028e-02
    Linear solve converged due to CONVERGED_RTOL iterations 34
Iteratively computed extreme singular values: max 0.999588 min 0.00495738 max/min 201.636
    2 SNES Function norm 6.241305825148e-04
    Linear solve converged due to CONVERGED_RTOL iterations 53
Iteratively computed extreme singular values: max 0.999841 min 0.00128697 max/min 776.898
    3 SNES Function norm 6.511700153462e-05
    Linear solve converged due to CONVERGED_RTOL iterations 48
Iteratively computed extreme singular values: max 0.99977 min 0.000992 max/min 1007.83
    4 SNES Function norm 8.152412445314e-08
    Linear solve converged due to CONVERGED_RTOL iterations 63
Iteratively computed extreme singular values: max 0.999734 min 0.000912558 max/min 1095.53
    5 SNES Function norm 8.470094594359e-11
    Linear solve converged due to CONVERGED_RTOL iterations 67
Iteratively computed extreme singular values: max 0.999908 min 0.00196439 max/min 509.018
    6 SNES Function norm 8.912298307564e-14
  Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 6

Question

  1. Is this elasticity suite expected to run alright on GPU?
  2. Is there a different configuration I should try, that would run on GPU and might be better than default for elasticity?

If it helps, my test models are scaled up variants of this sort of structure. image

jedbrown commented 2 years ago

Also, with the elasticity suite, I sometimes have HYPRE_BoomerAMGSetup return error code 12 = 0b1100. I'm not sure how to translate this.

#define HYPRE_ERROR_GENERIC         1   /* generic error */
#define HYPRE_ERROR_MEMORY          2   /* unable to allocate memory */
#define HYPRE_ERROR_ARG             4   /* argument error */
/* bits 4-8 are reserved for the index of the argument error */
#define HYPRE_ERROR_CONV          256   /* method did not converge as expected */

Most recently, this was for a problem with less than 100k dofs after two successful Newton solves.

   0 SNES Function norm 3.435982086055e-03
    Linear solve converged due to CONVERGED_RTOL iterations 62
Iteratively computed extreme singular values: max 0.99973 min 0.00244952 max/min 408.133
    1 SNES Function norm 1.619585499176e-02
    Linear solve converged due to CONVERGED_RTOL iterations 30
Iteratively computed extreme singular values: max 0.999418 min 0.0104634 max/min 95.5159
    2 SNES Function norm 7.427067557071e-04
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Error in external library
[0]PETSC ERROR: Error in jac->setup(): error code 12
liruipeng commented 2 years ago

@jedbrown @tzanio We don't have this elasticity solver ported to GPU yet.

jedbrown commented 2 years ago

Okay, thanks. Should the math still work to have small condition numbers?

Are there any options you'd suggest to improve elasticity convergence while running on GPU?

victorapm commented 2 years ago

Hi Jed,

I'm not sure about your first question. Regarding the second, have you tried the unknown-based approach of BoomerAMG? In this approach, interpolation is computed only within the same variable types. That should work at least with CUDA.

In MFEM, you can find how to follow this approach at https://github.com/mfem/mfem/blob/a8b5004ee9c81ae6ee730ae1fc8a3a78625022b1/linalg/hypre.cpp#L4609-L4650.

I'm interested in reproducing your test suite on elasticity. Can you give me directions on how to do that with PETSc?

Thank you! Victor

PS: I just realized your default options are probably what I suggested with the unknown-based approach. I'm still interested in trying out your problem, if you can give me the directions... Thanks!

PS2: Could you check the value of the strength threshold being used? Hypre's default is 0.25; however, in my experience, a higher value such as 0.5 might lead to better convergence for this kind of problem.

liruipeng commented 2 years ago

These are the BoomerAMG parameters used by MFEM for elasticity problems, https://github.com/mfem/mfem/blob/a8b5004ee9c81ae6ee730ae1fc8a3a78625022b1/linalg/hypre.cpp#L4730

OK. Never mind. You already use these parameters. I assume you use unified memory with GPUs, which is the only way I can see how it can work. So it's not surprising why the setup is so long. I am not sure about the condition number difference. If the same algorithm was running on GPUs, it's supposed to be the same. Is something set differently for GPUs of this solver such as coarsening, smoother, etc, for performance reasons? @tzanio

jedbrown commented 2 years ago

@victorapm Yeah, those options are about what I'm doing, but with the addition of the interp vectors.

@liruipeng Yes, that's the elasticity suite I linked up top, but it's giving worse conditon numbers and convergence rate than standard options (never mind no setup on GPU). That seems ... unexpected.

@victorapm To run this, you'll need PETSc and libceed main. You can get both by configuring PETSc with --download-libceed --download-libceed-commit=origin/main. Configure PETSc and libCEED with HIP or CUDA if you want, and add -ceed /gpu/hip or -ceed /gpu/cuda below. Drop the -mg_coarse_pc_type hypre below for the default (PETSc's GAMG, which is smoothed aggregation).

$ git clone https://gitlab.com/micromorph/ratel
$ make build/ex02-quasistatic-elasticity
$ mpiexec -n 4 build/ex02-quasistatic-elasticity -order 2 -dm_plex_shape schwarz_p -dm_plex_tps_thickness .2 -dm_plex_tps_extent 3,3,3 -dm_plex_tps_layers 2 -dm_plex_tps_refine 2 -material fs-current-nh -E 1 -nu .3 -bc_clamp 1 -bc_traction 2 -bc_traction_2 .02,0,0 -ts_dt 1 -ts_adapt_monitor -snes_monitor -ksp_converged_reason -dm_view -ksp_rtol 1e-3 -snes_converged_reason -ksp_view_singularvalues -log_view -mg_coarse_pc_type hypre -mg_coarse_pc_hypre_boomeramg_strong_threshold 0.5
[...]
    0 SNES Function norm 3.435982086055e-03
    Linear solve converged due to CONVERGED_RTOL iterations 22
Iteratively computed extreme singular values: max 0.998342 min 0.0241254 max/min 41.3815
    1 SNES Function norm 1.620015186090e-02
    Linear solve converged due to CONVERGED_RTOL iterations 16
Iteratively computed extreme singular values: max 0.997232 min 0.0380382 max/min 26.2166
    2 SNES Function norm 7.432727594336e-04
    Linear solve converged due to CONVERGED_RTOL iterations 30
Iteratively computed extreme singular values: max 0.999412 min 0.00856109 max/min 116.739
    3 SNES Function norm 9.017243007464e-05
    Linear solve converged due to CONVERGED_RTOL iterations 18
Iteratively computed extreme singular values: max 0.997949 min 0.0291622 max/min 34.2207
    4 SNES Function norm 1.040703764240e-07
    Linear solve converged due to CONVERGED_RTOL iterations 34
Iteratively computed extreme singular values: max 0.999439 min 0.00845893 max/min 118.152
    5 SNES Function norm 9.317309190373e-11
    Linear solve converged due to CONVERGED_RTOL iterations 33
Iteratively computed extreme singular values: max 0.999476 min 0.00910391 max/min 109.785
    6 SNES Function norm 7.437332441628e-14
  Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 6
[...]

This process will get a bit simpler after the upcoming releases.

liruipeng commented 2 years ago

@jedbrown It is unexpected... by "standard options" you meant the MFEM elasticity on CPUs, right? We need to take a closer look at what's actually running on GPUs (I guess only the solve phase) and with what parameters. I will coordinate with @victorapm. Thank you for reporting this.

jedbrown commented 2 years ago

"standard options" = "Default options" up top :point_up: . Adding the elasticity suite of options (my PETSc command line there is equivalent to whtat MFEM does in code) is causing more iterations and larger condition numbers. I'd expect it to give more reliable small condition numbers, even if much higher setup costs.

ulrikeyang commented 2 years ago

I noticed that you use a strength threshold of 0.5. We usually use a larger one, like 0.7 or so, for elasticity. You could try to increase it to see if that improves things.

jedbrown commented 2 years ago

I'm seeing smaller thresholds converging somewhat better.

ulrikeyang commented 2 years ago

Interesting. 0.25 is our default for HMIS or PMIS coarsening for diffusion problems.

victorapm commented 2 years ago

Thanks for the instructions, @jedbrown!

I'm able to run your test case, but the number of iterations that I'm getting is different than yours. Could you double-check if my command-line options are correct? petsc-default.log

Also, I noticed PETSc uses a different default than hypre for BoomerAMG, i.e., it uses Falgout-CLJP coarsening, Symmetric GS smoother, and classical modified interpolation with no truncation. I'll try out with our defaults and let you know the answer.

Thanks!

jedbrown commented 2 years ago

Interesting, I see the convergence is quite different when run on CPU versus configured with HIP because different algorithms are requested. I need a different build of Hypre when I give it input on the device. Comparing these two log files (small problem, run sequentially) explain how the algorithm is different.

ratel-hip.log ratel-cpu.log

image

victorapm commented 2 years ago

@jedbrown, I see that PETSc changes its default options for BoomerAMG when using the device, see lines lines 1950-1962. Alternatively, you could match the options used for GPUs on the CPU side as well.

I tested your problem with our default CPU options for BoomerAMG: hypre-default.log

It turns out that the PETSc CPU defaults work better:

Linear solve converged due to CONVERGED_RTOL iterations 10
Linear solve converged due to CONVERGED_RTOL iterations 8
Linear solve converged due to CONVERGED_RTOL iterations 15
Linear solve converged due to CONVERGED_RTOL iterations 13
Linear solve converged due to CONVERGED_RTOL iterations 15
Linear solve converged due to CONVERGED_RTOL iterations 17

vs hypre defaults:

Linear solve converged due to CONVERGED_RTOL iterations 16
Linear solve converged due to CONVERGED_RTOL iterations 11
Linear solve converged due to CONVERGED_RTOL iterations 17
Linear solve converged due to CONVERGED_RTOL iterations 17
Linear solve converged due to CONVERGED_RTOL iterations 24
Linear solve converged due to CONVERGED_RTOL iterations 26

I'll see if I can find a more suitable configuration...

jedbrown commented 2 years ago

Awesome, thanks for checking it out. And please let us know if a different GPU configuration is recommended. That configuration was selected by @stefanozampini based on what worked on GPU at the time.

victorapm commented 2 years ago

Jed, this is the closest GPU-friendly configuration, in terms of convergence, to the best CPU-friendly configuration of BoomerAMG for your problem:

-mg_coarse_pc_hypre_boomeramg_coarsen_type pmis
-mg_coarse_pc_hypre_boomeramg_interp_type ext+i
-mg_coarse_pc_hypre_boomeramg_no_CF
-mg_coarse_pc_hypre_boomeramg_P_max 6
-mg_coarse_pc_hypre_boomeramg_print_statistics 1
-mg_coarse_pc_hypre_boomeramg_relax_type_down Chebyshev
-mg_coarse_pc_hypre_boomeramg_relax_type_up Chebyshev
-mg_coarse_pc_hypre_boomeramg_strong_threshold 0.5
-mg_coarse_pc_type hypre

Here is the log: hypre-Pmax6-PMIS-th05-Cheb.log

And iterations counts:

Linear solve converged due to CONVERGED_RTOL iterations 13
Linear solve converged due to CONVERGED_RTOL iterations 10
Linear solve converged due to CONVERGED_RTOL iterations 15
Linear solve converged due to CONVERGED_RTOL iterations 17
Linear solve converged due to CONVERGED_RTOL iterations 19
Linear solve converged due to CONVERGED_RTOL iterations 22

The Chebyshev smoother is particularly important for improving convergence with respect to the GPU defaults set in PETSc (L1-Jacobi). Chebyshev works better than L1-Jac in general (right, @liruipeng?). So, I would suggest to update the PETSc GPU defaults to that method (it is option 16).

liruipeng commented 2 years ago

@victorapm Convergence-wise Chebyshev is better than L1-Jacobi in general but maybe not in terms of time-to-solution. So we still keep Jacobi as the default. For elasticity it makes sense to have a stronger smoother such as Chebyshev. I am not so sure about the overall efficiency though.

victorapm commented 2 years ago

That makes sense, @liruipeng! Thanks! Perhaps, I was not very careful in making the suggestion to change the defaults... Please, disregard that @jedbrown. Still, as Ruipeng said, I believe it makes sense to use it on elasticity problems...

liruipeng commented 2 years ago

@victorapm @ulrikeyang OK. Thanks! We still need to figure out why our specialized elasticity solver in BoomerAMG (which I often referred to as the BKY solver) gives worse condition numbers.

jedbrown commented 2 years ago

Awesome, thanks so much. We'll use this suite in the scaling studies we plan over the next week (using both Hypre and GAMG).

victorapm commented 2 years ago

Sounds good, @jedbrown. Let us know the results you get. Thanks!