hypre-space / hypre

Parallel solvers for sparse linear systems featuring multigrid methods.
https://www.llnl.gov/casc/hypre/
Other
675 stars 186 forks source link

AMG performance scaling problems #691

Closed lahwaacz closed 2 years ago

lahwaacz commented 2 years ago

Hi! I'm integrating Hypre into my PDE solver based on TNL and although it works, I'm observing quite poor performance scaling with the Hypre's AMG solver. To demonstrate this, I also ran a benchmark using the ex5 example (modified to include timers around the HYPRE_ParCSRPCGSetup and HYPRE_ParCSRPCGSetup calls):

Compute Hypre setup Hypre solve
MPI processes Time Speedup Efficiency Time Speedup Efficiency Time Speedup Efficiency
1 17.74 1.00 1.00 7.65 1.00 1.00 10.09 1.00 1.00
2 13.46 1.32 0.66 7.25 1.06 0.53 6.21 1.62 0.81
4 8.62 2.06 0.51 5.06 1.51 0.38 3.56 2.84 0.71
6 5.85 3.03 0.51 3.41 2.25 0.37 2.45 4.13 0.69
8 4.81 3.69 0.46 2.33 3.28 0.41 2.48 4.07 0.51
12 3.69 4.81 0.40 1.79 4.28 0.36 1.90 5.30 0.44
24 2.01 8.83 0.37 0.88 8.69 0.36 1.13 8.93 0.37

The example was run with -solver 1 -n 3000, i.e. AMG-PCG with 9e6 unknowns. The system is a dual-socket node with 2x Intel Xeon Gold 6146 (12 cores, 3.2-4.2 GHz, 24.75 MiB L3 cache, 6x 21 GB/s memory throughput). The MPI processes are mapped by core, i.e. the rows with 1-12 processes use only one CPU and only the last row with 24 processes uses both CPUs.

The "Compute" column is a sum of "Hypre setup" (HYPRE_ParCSRPCGSetup) and "Hypre solve" (HYPRE_ParCSRPCGSolve), i.e. only the solution is measured (not the problem setup). You can see from the table that the setup does not scale at all as the efficiency drops to about 50% going from 1 to 2 processes and the solve function also has rather poor efficiency (given the 6-channel memory on the system, I was expecting to get at least 80% for 6 processes). Is this expected? Any ideas how to investigate this and improve it?

lahwaacz commented 2 years ago

One suspicious thing I noticed is the number of iterations:

MPI processes Iterations
1 6
2 7
4 8
6 8
8 9
12 9
24 10

So for 24 processes there are almost twice many iterations than for 1 process. What needs to be changed in ex5 to avoid this? Also this does not explain the scaling problems with the AMG setup.

ulrikeyang commented 2 years ago

Those numbers look ok to me. Generally going from 1 process to 2 generates a lot of communication overhead. After that things get better. Also, regarding the convergence: Note that the smoother changes, when you increase the number of processes, since it is basically a inexact block Gauss-Seidel with block size decreasing when you increase the number of processes. So, you can end up with different convergence.

From: Jakub Klinkovský @.> Sent: Wednesday, July 27, 2022 10:36 AM To: hypre-space/hypre @.> Cc: Subscribed @.***> Subject: Re: [hypre-space/hypre] AMG performance scaling problems (Issue #691)

One suspicious thing I noticed is the number of iterations: MPI processes Iterations 1 6 2 7 4 8 6 8 8 9 12 9 24 10

So for 24 processes there are almost twice many iterations than for 1 process. What needs to be changed in ex5 to avoid this? Also this does not explain the scaling problems with the AMG setup.

— Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https:/github.com/hypre-space/hypre/issues/691*issuecomment-1197086959__;Iw!!G2kpM7uM-TzIFchu!gNxzzMbhuNXEZpJEVvkCtz089h2ynAdmhqcnbWH4zOImhEoxBDs98wPFlmXuQitz$, or unsubscribehttps://urldefense.us/v3/__https:/github.com/notifications/unsubscribe-auth/AD4NLLPEGOCHAU2SDXOEKD3VWFXRLANCNFSM54YKEZMQ__;!!G2kpM7uM-TzIFchu!gNxzzMbhuNXEZpJEVvkCtz089h2ynAdmhqcnbWH4zOImhEoxBDs98wPFltdFviqQ$. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

rfalgout commented 2 years ago

I agree with @ulrikeyang . Also, note that efficiency can be a red herring; I can make any solver more efficient by adding lots of extra local computation and ultimately making it slower. These efficiencies don't look out of the ballpark to me. Note too that for this example problem, a very simple parallel decomposition is used just to make the code simple to follow. This decomposition doesn't have the best surface-to-volume ratio, so it will be less efficient than the best distribution. Also, hypre targets huge problems on massively parallel architectures (billions of cores on the current generation of machines), so it is more important to look at weak scaling. There is lots more to say about this topic, but the bottom line is that what you are seeing does not look unreasonable.

Let us know if you have other questions. Thanks!

lahwaacz commented 2 years ago

Hi, thanks for your responses. I see that reproducing with ex5 was not a good idea as the problems I'm actually dealing with are very different. I'm trying to use Hypre for linear systems arising from MHFEM discretizations of a parabolic PDE on a triangular or tetrahedral mesh. You can find details on the scheme in this paper. As @rfalgout mentioned weak scaling, this is actually quite problematic to do for the whole PDE problem, because refining the mesh means that also the time step needs to be decreased and that causes the main increase of the total number of iterations. On the other hand, decreasing the time step also means that the previous state should be "closer" to the next state, so it is quite good initial guess for iterative solvers. It's funny that this makes even simple Jacobi-preconditioned BiCGstab quite competitive.

So far I'm using AMG-preconditioned BiCGstab and I'm setting these options for the preconditioner (some are just defaults):

HYPRE_BoomerAMGSetCoarsenType( precond, 10 );
HYPRE_BoomerAMGSetRelaxType( precond, 8 );
HYPRE_BoomerAMGSetNumSweeps( precond, 1 );
HYPRE_BoomerAMGSetStrongThreshold( precond, 0.5 );
HYPRE_BoomerAMGSetInterpType( precond, 6 );
HYPRE_BoomerAMGSetPMaxElmts( precond, 4 );

Also HYPRE_BoomerAMGSetNumFunctions( precond, 2 ); since I'm solving a system of 2 PDEs (and the dofs are mapped correctly).

My current 3D benchmark problems has about 7.8M unknowns and with Jacobi-preconditioned BiCGstab, the linear systems converge after 530 iterations on average. With BoomerAMG the iterations count drops to about 16, but the computational time on one node (24 cores) improved by only a factor of 1.5, so I was expecting more.

Do you have some advice on which options I should try to play with? I can reuse the preconditioner for several time steps, so the setup cost is not important. I think a good strategy would be to try to reduce the number of iterations further, or to reduce the cost per iteration (perhaps at the cost of more iterations).

As my problem size is probably nowhere near the target for Hypre, do you have some numbers to compare? How many unknowns do you usually keep per node? Also, at which point can I expect GPU acceleration to have an effect (and multi-GPU to be faster than one-GPU)?

victorapm commented 2 years ago

With BoomerAMG the iterations count drops to about 16, but the computational time on one node (24 cores) improved by only a factor of 1.5, so I was expecting more

Can you reduce the problem size to about 100k unknowns and solve it on a single core? What would be the speedup of BoomerAMG-BiCGstab wrt to Jacobi-BiCGstab?

Do you have some advice on which options I should try to play with?

You could try to turn on aggressive coarsening with HYPRE_BoomerAMGSetAggNumLevels(amg_precond, 1);, this reduces the cost of applying BoomerAMG to a vector. If you use HYPRE_BoomerAMGSetPrintLevel(amg_precond, 1);, that will give you some information about the multigrid hierarchy. Reducing the grid/operator complexities will result in faster AMG during the solve phase, but that can degrade convergence in general, so there is a trade-off.

How many unknowns do you usually keep per node?

That depends a lot on the problem (stencil size, grid partitioning...) and machine architecture (number of physical cores, memory bandwidth...). In general, 100k unknowns/core is fine. When using GPUs, something closer to 1M unknowns per GPU is better to achieve higher throughput.

lahwaacz commented 2 years ago

Hi @victorapm, thanks for the response! I've been playing with the options in the meantime and got to the point where BoomerAMG-BiCGstab is about 4.9x faster than Jacobi-BiCGstab for the aforementioned benchmark problem (7.8M unknowns, 24 cores on 1 node). The HYPRE_BoomerAMGSetAggNumLevels( precond, 1 ); option indeed helped quite a bit in terms of the compute time. I also found that HYPRE_BoomerAMGSetStrongThreshold( precond, 0.25 ); works better for this problem even in 3D and also HYPRE_BoomerAMGSetTruncFactor( precond, 0.3 ); seems to bring some improvement (but only for HMIS on the host, with the PMIS coarsening on GPU it may actually destroy convergence).

Can you reduce the problem size to about 100k unknowns and solve it on a single core? What would be the speedup of BoomerAMG-BiCGstab wrt to Jacobi-BiCGstab?

I have even smaller problems. For about 120k unknowns, BoomerAMG-BiCGstab is about 1.9x faster than Jacobi-BiCGstab (the latter converges after about 160 iterations, BoomerAMG needs about 11). But it's a discretization on a much coarser mesh, so the matrix properties and iteration counts are different from the large problem.

So I think the current performance is satisfactory. Compared to Jacobi, BoomerAMG has much better convergence rate and that's ultimately the goal. I think my questions were answered and this issue can be closed, though I would welcome any more suggestions where I should look for further performance benefits. Thanks!