Open jedbrown opened 2 years ago
Also, with the elasticity suite, I sometimes have HYPRE_BoomerAMGSetup
return error code 12 = 0b1100. I'm not sure how to translate this.
#define HYPRE_ERROR_GENERIC 1 /* generic error */
#define HYPRE_ERROR_MEMORY 2 /* unable to allocate memory */
#define HYPRE_ERROR_ARG 4 /* argument error */
/* bits 4-8 are reserved for the index of the argument error */
#define HYPRE_ERROR_CONV 256 /* method did not converge as expected */
Most recently, this was for a problem with less than 100k dofs after two successful Newton solves.
0 SNES Function norm 3.435982086055e-03
Linear solve converged due to CONVERGED_RTOL iterations 62
Iteratively computed extreme singular values: max 0.99973 min 0.00244952 max/min 408.133
1 SNES Function norm 1.619585499176e-02
Linear solve converged due to CONVERGED_RTOL iterations 30
Iteratively computed extreme singular values: max 0.999418 min 0.0104634 max/min 95.5159
2 SNES Function norm 7.427067557071e-04
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Error in external library
[0]PETSC ERROR: Error in jac->setup(): error code 12
@jedbrown @tzanio We don't have this elasticity solver ported to GPU yet.
Okay, thanks. Should the math still work to have small condition numbers?
Are there any options you'd suggest to improve elasticity convergence while running on GPU?
Hi Jed,
I'm not sure about your first question. Regarding the second, have you tried the unknown-based approach of BoomerAMG? In this approach, interpolation is computed only within the same variable types. That should work at least with CUDA.
In MFEM, you can find how to follow this approach at https://github.com/mfem/mfem/blob/a8b5004ee9c81ae6ee730ae1fc8a3a78625022b1/linalg/hypre.cpp#L4609-L4650.
I'm interested in reproducing your test suite on elasticity. Can you give me directions on how to do that with PETSc?
Thank you! Victor
PS: I just realized your default options are probably what I suggested with the unknown-based approach. I'm still interested in trying out your problem, if you can give me the directions... Thanks!
PS2: Could you check the value of the strength threshold being used? Hypre's default is 0.25; however, in my experience, a higher value such as 0.5 might lead to better convergence for this kind of problem.
These are the BoomerAMG parameters used by MFEM for elasticity problems, https://github.com/mfem/mfem/blob/a8b5004ee9c81ae6ee730ae1fc8a3a78625022b1/linalg/hypre.cpp#L4730
OK. Never mind. You already use these parameters. I assume you use unified memory with GPUs, which is the only way I can see how it can work. So it's not surprising why the setup is so long. I am not sure about the condition number difference. If the same algorithm was running on GPUs, it's supposed to be the same. Is something set differently for GPUs of this solver such as coarsening, smoother, etc, for performance reasons? @tzanio
@victorapm Yeah, those options are about what I'm doing, but with the addition of the interp vectors.
@liruipeng Yes, that's the elasticity suite I linked up top, but it's giving worse conditon numbers and convergence rate than standard options (never mind no setup on GPU). That seems ... unexpected.
@victorapm To run this, you'll need PETSc and libceed main. You can get both by configuring PETSc with --download-libceed --download-libceed-commit=origin/main
. Configure PETSc and libCEED with HIP or CUDA if you want, and add -ceed /gpu/hip
or -ceed /gpu/cuda
below. Drop the -mg_coarse_pc_type hypre
below for the default (PETSc's GAMG, which is smoothed aggregation).
$ git clone https://gitlab.com/micromorph/ratel
$ make build/ex02-quasistatic-elasticity
$ mpiexec -n 4 build/ex02-quasistatic-elasticity -order 2 -dm_plex_shape schwarz_p -dm_plex_tps_thickness .2 -dm_plex_tps_extent 3,3,3 -dm_plex_tps_layers 2 -dm_plex_tps_refine 2 -material fs-current-nh -E 1 -nu .3 -bc_clamp 1 -bc_traction 2 -bc_traction_2 .02,0,0 -ts_dt 1 -ts_adapt_monitor -snes_monitor -ksp_converged_reason -dm_view -ksp_rtol 1e-3 -snes_converged_reason -ksp_view_singularvalues -log_view -mg_coarse_pc_type hypre -mg_coarse_pc_hypre_boomeramg_strong_threshold 0.5
[...]
0 SNES Function norm 3.435982086055e-03
Linear solve converged due to CONVERGED_RTOL iterations 22
Iteratively computed extreme singular values: max 0.998342 min 0.0241254 max/min 41.3815
1 SNES Function norm 1.620015186090e-02
Linear solve converged due to CONVERGED_RTOL iterations 16
Iteratively computed extreme singular values: max 0.997232 min 0.0380382 max/min 26.2166
2 SNES Function norm 7.432727594336e-04
Linear solve converged due to CONVERGED_RTOL iterations 30
Iteratively computed extreme singular values: max 0.999412 min 0.00856109 max/min 116.739
3 SNES Function norm 9.017243007464e-05
Linear solve converged due to CONVERGED_RTOL iterations 18
Iteratively computed extreme singular values: max 0.997949 min 0.0291622 max/min 34.2207
4 SNES Function norm 1.040703764240e-07
Linear solve converged due to CONVERGED_RTOL iterations 34
Iteratively computed extreme singular values: max 0.999439 min 0.00845893 max/min 118.152
5 SNES Function norm 9.317309190373e-11
Linear solve converged due to CONVERGED_RTOL iterations 33
Iteratively computed extreme singular values: max 0.999476 min 0.00910391 max/min 109.785
6 SNES Function norm 7.437332441628e-14
Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 6
[...]
This process will get a bit simpler after the upcoming releases.
@jedbrown It is unexpected... by "standard options" you meant the MFEM elasticity on CPUs, right? We need to take a closer look at what's actually running on GPUs (I guess only the solve phase) and with what parameters. I will coordinate with @victorapm. Thank you for reporting this.
"standard options" = "Default options" up top :point_up: . Adding the elasticity suite of options (my PETSc command line there is equivalent to whtat MFEM does in code) is causing more iterations and larger condition numbers. I'd expect it to give more reliable small condition numbers, even if much higher setup costs.
I noticed that you use a strength threshold of 0.5. We usually use a larger one, like 0.7 or so, for elasticity. You could try to increase it to see if that improves things.
I'm seeing smaller thresholds converging somewhat better.
-mg_coarse_pc_hypre_boomeramg_strong_threshold 0.25
0 SNES Function norm 3.435982086055e-03
Linear solve converged due to CONVERGED_RTOL iterations 19
Iteratively computed extreme singular values: max 0.997492 min 0.0354589 max/min 28.1309
1 SNES Function norm 1.619768019712e-02
Linear solve converged due to CONVERGED_RTOL iterations 15
Iteratively computed extreme singular values: max 0.99662 min 0.0449147 max/min 22.1892
2 SNES Function norm 7.429167511440e-04
Linear solve converged due to CONVERGED_RTOL iterations 24
Iteratively computed extreme singular values: max 0.999101 min 0.0250349 max/min 39.9083
3 SNES Function norm 9.096363872434e-05
Linear solve converged due to CONVERGED_RTOL iterations 18
Iteratively computed extreme singular values: max 0.997949 min 0.0297238 max/min 33.5741
4 SNES Function norm 9.978339384488e-08
Linear solve converged due to CONVERGED_RTOL iterations 32
Iteratively computed extreme singular values: max 0.999323 min 0.0105146 max/min 95.0418
5 SNES Function norm 9.615909073729e-11
Linear solve converged due to CONVERGED_RTOL iterations 29
Iteratively computed extreme singular values: max 0.999311 min 0.0106429 max/min 93.8943
6 SNES Function norm 8.476371750653e-14
Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 6
-mg_coarse_pc_hypre_boomeramg_strong_threshold 0.75
0 SNES Function norm 3.435982086055e-03
Linear solve converged due to CONVERGED_RTOL iterations 33
Iteratively computed extreme singular values: max 0.99933 min 0.0103877 max/min 96.2031
1 SNES Function norm 1.619697199026e-02
Linear solve converged due to CONVERGED_RTOL iterations 20
Iteratively computed extreme singular values: max 0.998244 min 0.0244837 max/min 40.7718
2 SNES Function norm 7.431033570042e-04
Linear solve converged due to CONVERGED_RTOL iterations 36
Iteratively computed extreme singular values: max 0.999609 min 0.00636051 max/min 157.159
3 SNES Function norm 9.051610840358e-05
Linear solve converged due to CONVERGED_RTOL iterations 29
Iteratively computed extreme singular values: max 0.999338 min 0.00640619 max/min 155.996
4 SNES Function norm 9.367396133857e-08
Linear solve converged due to CONVERGED_RTOL iterations 45
Iteratively computed extreme singular values: max 0.99985 min 0.0046344 max/min 215.745
5 SNES Function norm 9.310160280617e-11
Linear solve converged due to CONVERGED_RTOL iterations 50
Iteratively computed extreme singular values: max 0.999734 min 0.00477714 max/min 209.275
6 SNES Function norm 9.337840233173e-14
Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 6
Interesting. 0.25 is our default for HMIS or PMIS coarsening for diffusion problems.
Thanks for the instructions, @jedbrown!
I'm able to run your test case, but the number of iterations that I'm getting is different than yours. Could you double-check if my command-line options are correct? petsc-default.log
Also, I noticed PETSc uses a different default than hypre for BoomerAMG, i.e., it uses Falgout-CLJP coarsening, Symmetric GS smoother, and classical modified interpolation with no truncation. I'll try out with our defaults and let you know the answer.
Thanks!
Interesting, I see the convergence is quite different when run on CPU versus configured with HIP because different algorithms are requested. I need a different build of Hypre when I give it input on the device. Comparing these two log files (small problem, run sequentially) explain how the algorithm is different.
@jedbrown, I see that PETSc changes its default options for BoomerAMG when using the device, see lines lines 1950-1962. Alternatively, you could match the options used for GPUs on the CPU side as well.
I tested your problem with our default CPU options for BoomerAMG: hypre-default.log
It turns out that the PETSc CPU defaults work better:
Linear solve converged due to CONVERGED_RTOL iterations 10
Linear solve converged due to CONVERGED_RTOL iterations 8
Linear solve converged due to CONVERGED_RTOL iterations 15
Linear solve converged due to CONVERGED_RTOL iterations 13
Linear solve converged due to CONVERGED_RTOL iterations 15
Linear solve converged due to CONVERGED_RTOL iterations 17
vs hypre defaults:
Linear solve converged due to CONVERGED_RTOL iterations 16
Linear solve converged due to CONVERGED_RTOL iterations 11
Linear solve converged due to CONVERGED_RTOL iterations 17
Linear solve converged due to CONVERGED_RTOL iterations 17
Linear solve converged due to CONVERGED_RTOL iterations 24
Linear solve converged due to CONVERGED_RTOL iterations 26
I'll see if I can find a more suitable configuration...
Awesome, thanks for checking it out. And please let us know if a different GPU configuration is recommended. That configuration was selected by @stefanozampini based on what worked on GPU at the time.
Jed, this is the closest GPU-friendly configuration, in terms of convergence, to the best CPU-friendly configuration of BoomerAMG for your problem:
-mg_coarse_pc_hypre_boomeramg_coarsen_type pmis
-mg_coarse_pc_hypre_boomeramg_interp_type ext+i
-mg_coarse_pc_hypre_boomeramg_no_CF
-mg_coarse_pc_hypre_boomeramg_P_max 6
-mg_coarse_pc_hypre_boomeramg_print_statistics 1
-mg_coarse_pc_hypre_boomeramg_relax_type_down Chebyshev
-mg_coarse_pc_hypre_boomeramg_relax_type_up Chebyshev
-mg_coarse_pc_hypre_boomeramg_strong_threshold 0.5
-mg_coarse_pc_type hypre
Here is the log: hypre-Pmax6-PMIS-th05-Cheb.log
And iterations counts:
Linear solve converged due to CONVERGED_RTOL iterations 13
Linear solve converged due to CONVERGED_RTOL iterations 10
Linear solve converged due to CONVERGED_RTOL iterations 15
Linear solve converged due to CONVERGED_RTOL iterations 17
Linear solve converged due to CONVERGED_RTOL iterations 19
Linear solve converged due to CONVERGED_RTOL iterations 22
The Chebyshev smoother is particularly important for improving convergence with respect to the GPU defaults set in PETSc (L1-Jacobi). Chebyshev works better than L1-Jac in general (right, @liruipeng?). So, I would suggest to update the PETSc GPU defaults to that method (it is option 16).
@victorapm Convergence-wise Chebyshev is better than L1-Jacobi in general but maybe not in terms of time-to-solution. So we still keep Jacobi as the default. For elasticity it makes sense to have a stronger smoother such as Chebyshev. I am not so sure about the overall efficiency though.
That makes sense, @liruipeng! Thanks! Perhaps, I was not very careful in making the suggestion to change the defaults... Please, disregard that @jedbrown. Still, as Ruipeng said, I believe it makes sense to use it on elasticity problems...
@victorapm @ulrikeyang OK. Thanks! We still need to figure out why our specialized elasticity solver in BoomerAMG (which I often referred to as the BKY solver) gives worse condition numbers.
Awesome, thanks so much. We'll use this suite in the scaling studies we plan over the next week (using both Hypre and GAMG).
Sounds good, @jedbrown. Let us know the results you get. Thanks!
I'm trying to tune a BoomerAMG configuration (via PETSc) to run on HIP. I'm using hypre-2.24 and rocm-5.0.2. I believe I'm matching the MFEM elasticity configuration, as explained in this paper. I have two issues: the iteration counts and condition numbers are higher than expected and the setup time is huge compared to the default configuration.
Default options
Just
HYPRE_BoomerAMGSetNumFunctions
to 3 andHYPRE_BoomerAMGSetInterpVectors
to the rigid body modes.Iteration counts above are a bit higher than GAMG. Condition numbers for GAMG stay in the range 17 to 28 throughout this solve, but GAMG setup time is enough higher that Hypre has a slight edge depending on node and job launch configuration (but less robustness as seen in the CG condition number estimates).
Elasticity suite
(This is the coarse problem of a matrix-free p-MG. The p-MG is very reliable. I can show the reduced, but this is what I care about and it's qualitatively the same for the linear discretization.)
Setup time is over 100x longer than seen above. I ran a much smaller problem because I'm impatient and iteration counts and condition numbers are much higher.
Question
If it helps, my test models are scaled up variants of this sort of structure.