Question: best way to achieve OpenMP parallelization

I am considering using HYPRE as a replacement for the in-house linear solver we have now in our commercial CFD code FLACS-CFD. The reason for a change is that we would like to implement AMR in the solver and SSTRUCT seems perfect for the purpose. With our software we are targeting desktop PCs and single nodes on clusters, therefore we never needed to implement MPI. The OpenMP parallelization is sufficient at such scale. I noticed that the user manual states that most of the solvers is lacking OpenMP parallelization and indeed I haven’t found any omp pragmas in the implementation of my solver of interest – BiCGSTAB or preconditioners. Therefore I would like to ask you for advice:

Do you plan or have started work on the OpenMP parallelization on some branch, is there a project I could join?
Can you recommend how to quickly integrate MPI version of HYPRE with OpenMP/serial implementation of the rest of the CFD code assuming we currently do not have any domain partitioning?
would using external OpenMP-parallel algebra package together with HYPRE be a solution

Hi @mfolusiak . We do support OpenMP parallelism in hypre, and I think that's available for most of the solvers. Can you point to where we say this in the manual:

https://hypre.readthedocs.io/en/latest/

Can you tell us more about the in-house solver you use? Thanks!

Hi @rfalgout , I found it on the intro page: https://github.com/hypre-space/hypre/blob/master/src/docs/usr-manual/ch-intro.rst#installing-hypre

Configuration of hypre with threads requires an implementation of OpenMP. Currently, only a subset of hypre is threaded.

Compiling with HYPRE_WITH_OPENMP=ON didn't seem to have an effect on performance in my initial tests, therefore I assumed my solver is not parallelized and asked this question. In the meantime today I found that indeed, the parallelization is probably realized using some abstraction called Kokkos. Is there any additional libraries or configuration needed for it to work? I think it is worth mentioning in the manual. The OpenMP parallelization is a great asset.

The in-house linear solver we are using now in our software is a block-structured BiCGSTAB with ILU preconditioner. We are using 7-diagonal format to specify the A arrar. Thank you for your help!

Hi @mfolusiak . BiCGSTAB should work fine with OpenMP (Kokkos isn't required to use OpenMP in hypre). ILU may be a problem. Maybe @liruipeng or @oseikuffuor1 can comment on that.

thanks @rfalgout - one follow up question, is this supported on Windows as well? Reason I ask is that I noted in the stable manual it stated that it requires OpenMP standard >= 4.5 whereas we compile our application with MS VS 2019 and I believe they are not quite there yet, it isn't entirely clear but looks like they're at OpenMP standard 3.0 or even 2.0 actually.

https://devblogs.microsoft.com/cppblog/improved-openmp-support-for-cpp-in-visual-studio/#:~:text=Microsoft%20Visual%20Studio%20has%20supported,in%20the%20OpenMP%204.0%20standard.

I read elsewhere that some options include - building with C++ CLang (which is now shipped with MS VS) or building using MNGW... suppose we do the latter (build/install HYPRE with MNGW) can we still use the openmp directives in our application and build with VS (given it is using older standard), will that work? Appreciate any comment on that, and in general on what is the recommended option, if any, for Windows?

Hi @sgthomas-github . OpenMP standard 3.0 will work fine for CPU code. The 4.5 version is one approach for running on GPUs (Struct interface only), but not the recommended way to run on them anyway. Hope this helps!

Thanks @rfalgout et al, I tested the latest hypre-2.24.0 installs, configured/compiled with and without openmp enabled respectively (our application is C++ built using msvc42 on VS2019) and while both installs work OK and match expected results on small controlled models, they fail on a complex extremely large (~53MM cells), extremely heterogeneous (stiffness matrix diagonal entries contrast as high as 1.0e13) model, both reach max iterations (1000), and the openmp install also reports nan's and extremely large residuals, while the serial install simply maxes on the iterations but never gets below tolerance 1.0e-08 (however, at least remains close - slightly higher). However the older version I had, i.e., 2.11.1 serial (without openmp) runs OK and converges in about 25 iterations on the complex model.

Wondering what could cause the change in behavior from older version? I have been using BoomerAMG as solver all along, however I noted that some of its defaults have changed since the prior version (e.g. parallel coarsening strategy, interpolation, and relaxation order), not sure how critical my choices were (6 for parallel coarsening, relax type set to 3, nothing set for relax order or interpolation - so they should default I guess, num sweeps set to 1, max levels set to 20) and whether they are contributing to the failure to converge - I am going to test next with those recommended defaults for 3-d; also unclear as to what specific options, if any, I should use when testing with openmp or are they the same as documented? Finally is there any solver/preconditioner recommendations for elliptic models (3-d diffusion) beyond BoomerAMG as solver for extremely large, extremely heterogeneous 3-d models? Thank you.

Hi @sgthomas-github . I'm pretty sure the default BoomerAMG parameters have indeed changed since 2.11.1, so that's definitely the first thing you should try to match. The OpenMP behavior will be different, but hopefully there are parameter choices that can also be made to work well for your problems. Once you've done more testing, send us the output generated by setting the print level to 1 or higher (call HYPRE_BoomerAMGSetPrintLevel()). The output will have information like this:

BoomerAMG SETUP PARAMETERS:

 Max levels = 25
 Num levels = 5

 Strength Threshold = 0.250000
 Interpolation Truncation Factor = 0.000000
 Maximum Row Sum Threshold for Dependency Weakening = 1.000000

 Coarsening Type = HMIS
 measures are determined locally

 No global partition option chosen.

 Interpolation = extended+i interpolation

Operator Matrix Information:

             nonzero            entries/row          row sums
lev    rows  entries sparse   min  max     avg      min         max
======================================================================
  0    1000     6400  0.006     4    7     6.4   0.000e+00   3.000e+00
  1     500     7248  0.029     7   17    14.5   0.000e+00   4.000e+00
  2      99     3003  0.306    15   43    30.3   1.041e-02   5.319e+00
  3      14      188  0.959    11   14    13.4   5.274e+00   1.007e+01
  4       4       16  1.000     4    4     4.0   7.597e+00   9.196e+00

Interpolation Matrix Information:
                    entries/row        min        max            row sums
lev  rows x cols  min  max  avgW     weight      weight       min         max
================================================================================
  0  1000 x 500     1    4   4.0   1.667e-01   2.500e-01   5.000e-01   1.000e+00
  1   500 x 99      1    4   4.0   1.301e-02   3.547e-01   2.164e-01   1.000e+00
  2    99 x 14      1    4   4.0   1.247e-03   3.928e-01   2.865e-02   1.000e+00
  3    14 x 4       1    4   3.6  -6.320e-02   6.629e-02  -6.121e-02   1.000e+00

     Complexity:    grid = 1.617000
                operator = 2.633594
                memory = 3.350625

And similar information for the solver parameters. This will help us to figure out how best to help you. Thanks!

Thanks @rfalgout et al, update fyi -

After I used the recommended defaults for 3-d elliptic (diffusion) problems, the BoomerAMG solver is now converging; basically I used method 10 (HMIS) for coarsening type, method 6 (extended+i) for interpolation type, a truncation factor of 5 (perhaps could be reduced to 4?) - don't know why it reports as 0 though - I used the HYPRE_BoomerAMGSetTruncFactor API for that, and for smoothing I used 13 for down cycle, 14 for up cycle, and 9 for coarsest level using the HYPRE_BoomerAMGSetCycleRelaxType API for all of them, and a strength threshold of 0.5 (I was using 0.25 before). It wasn't clear to me how/whether to set smoothing on fine grid - I only set on down, up cycle, and coarsest level. Here is a copy-paste of the AMG solver parameters output with my settings for 3-different solves:

I-dir solve:

BoomerAMG SETUP PARAMETERS:

 Max levels = 20
 Num levels = 14

 Strength Threshold = 0.500000
 Interpolation Truncation Factor = 0.000000
 Maximum Row Sum Threshold for Dependency Weakening = 0.900000

 Coarsening Type = HMIS
 measures are determined locally

 No global partition option chosen.

 Interpolation = extended+i interpolation

Operator Matrix Information:

              nonzero             entries/row          row sums
lev     rows   entries sparse   min  max     avg      min         max
========================================================================
  0 52585034 367001350  0.000     3   16     7.0  -3.789e-03   2.366e-12
  1 26055591 381547697  0.000     3   48    14.6  -6.679e-03   2.453e-12
  2 12620819 205386377  0.000     3   88    16.3  -1.030e-02   4.098e-12
  3  5789306 173898730  0.000     3  189    30.0  -2.039e-02   1.219e-05
  4  2406096 123201954  0.000     4  313    51.2  -6.031e-02   8.485e-07
  5   664071  57692083  0.000     5  407    86.9  -1.572e-01   2.372e-04
  6   159206  19741416  0.001     5  463   124.0  -4.319e-01   4.925e-02
  7    36098   5190622  0.004    11  518   143.8  -6.580e-01   4.754e-02
  8     7671    910105  0.015    17  380   118.6  -1.049e+00   1.970e-01
  9     1507    114827  0.051    19  213    76.2  -1.090e+00   1.570e+00
 10      331     17379  0.159    13  112    52.5  -1.963e+03   1.188e+03
 11       78      2622  0.431    10   50    33.6  -1.388e+00   4.481e-03
 12       24       420  0.729    11   23    17.5  -1.573e+00   2.681e-06
 13        6        34  0.944     5    6     5.7  -1.757e+00  -7.207e-02

Interpolation Matrix Information:
                          entries/row        min        max            row sums
lev     rows x cols     min  max  avgW     weight      weight       min         max
======================================================================================
  0 52585034 x 26055591   1    4   2.0   2.763e-02   1.000e+00   8.121e-01   1.000e+00
  1 26055591 x 12620819   1    4   2.0   2.129e-02   1.000e+00   4.826e-01   1.000e+00
  2 12620819 x 5789306    1    4   3.3  -5.985e-01   1.017e+00   3.801e-01   1.000e+00
  3  5789306 x 2406096    1    4   3.1  -5.164e+00   6.708e+00   8.982e-02   1.001e+00
  4  2406096 x 664071     1    4   3.7  -9.342e+01   1.018e+02   4.476e-02   1.000e+00
  5   664071 x 159206     1    4   3.6  -2.540e+02   2.067e+02  -7.027e-02   1.028e+00
  6   159206 x 36098      0    4   3.3  -6.946e+03   3.366e+03  -4.769e-01   1.459e+00
  7    36098 x 7671       0    4   3.1  -1.159e+03   1.264e+03  -2.067e+00   2.616e+00
  8     7671 x 1507       0    4   2.6  -6.005e+01   1.144e+02  -1.383e-01   1.292e+00
  9     1507 x 331        1    4   2.5  -5.542e+02   9.167e+02   7.942e-02   7.183e+00
 10      331 x 78         0    4   2.3  -8.998e-01   1.049e+00  -4.381e-01   1.049e+00
 11       78 x 24         1    4   2.5   4.275e-02   1.000e+00   2.690e-01   1.003e+00
 12       24 x 6          1    3   1.9   1.276e-01   1.000e+00   4.619e-01   1.000e+00

     Complexity:    grid = 1.907878
                operator = 3.636787
                memory = 4.098985

BoomerAMG SOLVER PARAMETERS:

  Maximum number of cycles:         1000
  Stopping Tolerance:               1.000000e-08
  Cycle type (1 = V, 2 = W, etc.):  1

  Relaxation Parameters:
   Visiting Grid:                     down   up  coarse
            Number of sweeps:            1    1     1
   Type 0=Jac, 3=hGS, 6=hSGS, 9=GE:     13   14     9
   Point types, partial sweeps (1=C, -1=F):
                  Pre-CG relaxation (down):   0
                   Post-CG relaxation (up):   0
                             Coarsest grid:   0

J-dir solve:

BoomerAMG SETUP PARAMETERS:

 Max levels = 20
 Num levels = 14

 Strength Threshold = 0.500000
 Interpolation Truncation Factor = 0.000000
 Maximum Row Sum Threshold for Dependency Weakening = 0.900000

 Coarsening Type = HMIS
 measures are determined locally

 No global partition option chosen.

 Interpolation = extended+i interpolation

Operator Matrix Information:

              nonzero             entries/row          row sums
lev     rows   entries sparse   min  max     avg      min         max
========================================================================
  0 52585034 367001350  0.000     3   16     7.0  -3.791e-03   2.366e-12
  1 26055591 381547697  0.000     3   48    14.6  -6.658e-03   2.453e-12
  2 12620743 205369991  0.000     3   88    16.3  -1.085e-02   4.098e-12
  3  5788682 173976184  0.000     3  189    30.1  -1.857e-02   2.057e-05
  4  2401772 122503076  0.000     4  313    51.0  -1.935e-02   5.924e-09
  5   664250  58033856  0.000     5  410    87.4  -3.459e-02   4.735e-04
  6   160808  19990486  0.001     5  465   124.3  -1.339e+00   5.667e-01
  7    36649   5330397  0.004    12  471   145.4  -3.959e-01   1.706e-01
  8     7825    947915  0.015    15  390   121.1  -2.720e-01   9.136e-02
  9     1566    121970  0.050    13  203    77.9  -7.766e+00   2.506e+00
 10      324     16592  0.158    12  111    51.2  -3.334e+02   2.209e+02
 11       77      2475  0.417     6   58    32.1  -3.979e+00   3.089e-01
 12       25       407  0.651     5   22    16.3  -9.450e-01  -2.946e-02
 13        6        36  1.000     6    6     6.0  -1.776e+00  -5.459e-01

Interpolation Matrix Information:
                          entries/row        min        max            row sums
lev     rows x cols     min  max  avgW     weight      weight       min         max
======================================================================================
  0 52585034 x 26055591   1    4   2.0   2.763e-02   1.000e+00   8.121e-01   1.000e+00
  1 26055591 x 12620743   1    4   2.0   2.129e-02   1.000e+00   4.860e-01   1.000e+00
  2 12620743 x 5788682    1    4   3.3  -5.985e-01   1.017e+00   3.800e-01   1.000e+00
  3  5788682 x 2401772    1    4   3.1  -1.377e+00   1.890e+00   7.853e-02   1.000e+00
  4  2401772 x 664250     1    4   3.7  -9.342e+01   1.018e+02   4.200e-02   1.000e+00
  5   664250 x 160808     1    4   3.6  -2.375e+02   1.979e+02  -2.010e-01   1.039e+00
  6   160808 x 36649      0    4   3.4  -4.545e+02   2.663e+02  -1.412e+00   1.271e+00
  7    36649 x 7825       0    4   3.1  -1.277e+02   1.508e+02  -5.011e-01   2.128e+00
  8     7825 x 1566       0    4   2.7  -6.585e+01   5.626e+01  -4.549e+00   2.906e+00
  9     1566 x 324        0    4   2.5  -3.428e+01   5.248e+01   0.000e+00   9.914e+00
 10      324 x 77         0    4   2.2   2.997e-02   1.500e+00   0.000e+00   1.500e+00
 11       77 x 25         1    4   2.5   3.998e-02   9.814e-01   1.537e-01   1.008e+00
 12       25 x 6          1    3   1.9   1.000e-01   9.409e-01   3.100e-01   1.000e+00

     Complexity:    grid = 1.907831
                operator = 3.637159
                memory = 4.099335

BoomerAMG SOLVER PARAMETERS:

  Maximum number of cycles:         1000
  Stopping Tolerance:               1.000000e-08
  Cycle type (1 = V, 2 = W, etc.):  1

  Relaxation Parameters:
   Visiting Grid:                     down   up  coarse
            Number of sweeps:            1    1     1
   Type 0=Jac, 3=hGS, 6=hSGS, 9=GE:     13   14     9
   Point types, partial sweeps (1=C, -1=F):
                  Pre-CG relaxation (down):   0
                   Post-CG relaxation (up):   0
                             Coarsest grid:   0

K-dir solve:

BoomerAMG SETUP PARAMETERS:

 Max levels = 20
 Num levels = 14

 Strength Threshold = 0.500000
 Interpolation Truncation Factor = 0.000000
 Maximum Row Sum Threshold for Dependency Weakening = 0.900000

 Coarsening Type = HMIS
 measures are determined locally

 No global partition option chosen.

 Interpolation = extended+i interpolation

Operator Matrix Information:

              nonzero             entries/row          row sums
lev     rows   entries sparse   min  max     avg      min         max
========================================================================
  0 52585034 367001350  0.000     3   16     7.0  -2.885e+02   2.366e-12
  1 26055591 381547697  0.000     3   48    14.6  -3.388e+02   2.453e-12
  2 12619201 203290507  0.000     3   88    16.1  -2.963e+02   4.098e-12
  3  5768656 172486924  0.000     3  189    29.9  -3.246e+02   1.946e-05
  4  2401754 122255830  0.000     4  313    50.9  -1.630e+02   5.434e-04
  5   668852  57440306  0.000     5  422    85.9  -1.201e+02   2.037e-01
  6   162105  19646451  0.001     6  489   121.2  -5.266e+01   2.527e+00
  7    36493   5102049  0.004     9  526   139.8  -9.439e+01   4.897e+01
  8     7616    842690  0.015    11  324   110.6  -2.985e+04   6.710e+03
  9     1458    101876  0.048    16  173    69.9  -7.759e+04   5.186e+01
 10      309     13349  0.140     8   86    43.2  -4.424e+00   3.433e-01
 11       80      2140  0.334     8   51    26.8  -2.236e+00  -7.240e-02
 12       22       280  0.579     7   17    12.7  -2.306e+00  -6.305e-01
 13        8        48  0.750     4    8     6.0  -2.568e+00  -1.386e+00

Interpolation Matrix Information:
                          entries/row        min        max            row sums
lev     rows x cols     min  max  avgW     weight      weight       min         max
======================================================================================
  0 52585034 x 26055591   1    4   2.0   2.763e-02   1.000e+00   3.332e-01   1.000e+00
  1 26055591 x 12619201   1    4   2.0   2.129e-02   1.000e+00   1.542e-01   1.000e+00
  2 12619201 x 5768656    1    4   3.3  -6.033e-01   1.017e+00   7.873e-02   1.000e+00
  3  5768656 x 2401754    0    4   3.1  -5.164e+00   6.708e+00   0.000e+00   1.003e+00
  4  2401754 x 668852     0    4   3.7  -4.201e+01   3.780e+01   0.000e+00   1.393e+00
  5   668852 x 162105     0    4   3.6  -6.293e+02   1.831e+03  -1.491e+00   6.027e+00
  6   162105 x 36493      0    4   3.3  -6.248e+02   2.604e+02  -4.586e+01   6.053e+00
  7    36493 x 7616       0    4   3.0  -4.818e+02   1.221e+03  -4.547e+00   5.583e+02
  8     7616 x 1458       0    4   2.6  -6.254e+01   6.791e+01  -1.454e+00   6.791e+01
  9     1458 x 309        0    4   2.3  -2.727e+00   1.390e+00  -1.818e+00   1.446e+00
 10      309 x 80         0    4   2.2  -2.398e-01   9.062e-01  -1.310e-01   1.000e+00
 11       80 x 22         0    4   1.6   2.241e-02   6.748e-01   0.000e+00   1.000e+00
 12       22 x 8          0    3   1.3   4.289e-02   3.874e-01   0.000e+00   1.000e+00

     Complexity:    grid = 1.907523
                operator = 3.623233
                memory = 4.085052

BoomerAMG SOLVER PARAMETERS:

  Maximum number of cycles:         1000
  Stopping Tolerance:               1.000000e-08
  Cycle type (1 = V, 2 = W, etc.):  1

  Relaxation Parameters:
   Visiting Grid:                     down   up  coarse
            Number of sweeps:            1    1     1
   Type 0=Jac, 3=hGS, 6=hSGS, 9=GE:     13   14     9
   Point types, partial sweeps (1=C, -1=F):
                  Pre-CG relaxation (down):   0
                   Post-CG relaxation (up):   0
                             Coarsest grid:   0

However I noted it is taking 61, 79, and 54 iterations respectively for each of the above solves, whereas 2.11.1 with my earlier settings (as mentioned in my previous post) is taking about 25 iterations on average and slightly shorter in total time... from reviewing the logs above, is there anything that stands out to you that I am still not doing or missing, that could potentially reduce the # of iterations or solver time further (inputs to both are exactly identical)?

Next I plan to test with openmp using the above settings..

The trunc factor should be a number between 0 and 1, so 5 would be set to 0. I think you wanted to actually set PMaxElmts which sets the maximum number of nonzeros in the interpolation matrix. There 4 or 5 is a good number, but 4 is the default and it clearly was used in your run. A strength threshold of 0.25 should be fine for use with HMIS. The slowdown in iterations is concerning, but I am not sure what you set before and also how long it actually took you. Previous default settings generally lead to larger complexities with faster convergence but slower iteration times. It looks like you have some nasty interpolation weights. You could try to set InterpType to 17 or 18, which gives you a newer, possibly better formulation of the interpolation operator and see if that improves convergence.

Thanks @ulrikeyang et al for that correction; also on closer examination, the 2.11.1 install although took fewer iterations for each solve, it actually took longer in solver time for I- and J-direction solves, and slightly shorter for K-direction solves. The solver output for a typical solve (e.g. I-dir) in 2.11.1 test looks as below:

BoomerAMG SETUP PARAMETERS:

 Max levels = 20
 Num levels = 20

 Strength Threshold = 0.250000
 Interpolation Truncation Factor = 0.000000
 Maximum Row Sum Threshold for Dependency Weakening = 0.900000

 Coarsening Type = Falgout-CLJP
 measures are determined locally

 Interpolation = modified classical interpolation

Operator Matrix Information:

            nonzero         entries per row        row sums
lev   rows  entries  sparse  min  max   avg       min         max
===================================================================
 0 52585034 367001350  0.000     3   16   7.0  -3.789e-03   2.366e-12
 1 26319841 380127341  0.000     3   46  14.4  -6.679e-03   2.714e-12
 2 13447013 581127639  0.000     3   81  43.2  -1.040e-02   2.665e-12
 3 5636584 419268200  0.000     4  212  74.4  -1.725e-02   4.347e-12
 4 2562220 375204122  0.000     5  469  146.4  -3.162e-02   4.533e-12
 5 1215332 326926178  0.000     6 1050  269.0  -6.512e-02   8.842e-05
 6  571708 252852046  0.001     7 1616  442.3  -8.126e-02   1.264e-07
 7  256760 165228956  0.003     9 2314  643.5  -1.261e-01   5.165e-12
 8  111827 90826635  0.007    16 2744  812.2  -1.682e-01   5.541e-12
 9   47253 40087457  0.018    15 2532  848.4  -3.317e-01   6.910e-12
10   19247 14491551  0.039    15 2189  752.9  -5.992e-01   8.990e-12
11    7925  5110295  0.081    30 1959  644.8  -9.279e-01   9.599e-12
12    3587  2258269  0.176    76 1851  629.6  -8.114e-01   1.004e-11
13    1766  1028498  0.330    52 1176  582.4  -7.884e-01   5.714e-12
14     892   386634  0.486    30  681  433.4  -4.166e-01   5.291e-05
15     393    99055  0.641    29  347  252.0  -4.724e-01   0.000e+00
16     169    22801  0.798    27  168  134.9  -9.272e-01   0.000e+00
17      60     3372  0.937    38   60  56.2  -1.028e+00   0.000e+00
18      20      400  1.000    20   20  20.0  -1.104e+00   0.000e+00
19       6       36  1.000     6    6   6.0  -4.855e-01   0.000e+00

Interpolation Matrix Information:
                 entries/row    min     max         row sums
lev  rows cols    min max     weight   weight     min       max
=================================================================
 0 52585034 x 26319841   1  11   4.460e-02 1.000e+00 8.430e-01 1.000e+00
 1 26319841 x 13447013   1  12   3.337e-02 1.000e+00 6.404e-01 1.000e+00
 2 13447013 x 5636584   1  16   2.127e-02 1.000e+00 4.350e-01 1.000e+00
 3 5636584 x 2562220   1  26   7.079e-03 1.000e+00 1.037e-01 1.000e+00
 4 2562220 x 1215332   1  35   4.025e-03 1.000e+00 5.754e-02 1.000e+00
 5 1215332 x 571708   1  38   3.113e-03 1.000e+00 6.017e-02 1.010e+00
 6 571708 x 256760   0  46   4.535e-03 1.000e+00 0.000e+00 1.000e+00
 7 256760 x 111827   0  46   3.612e-03 1.000e+00 0.000e+00 1.000e+00
 8 111827 x 47253   0  47   4.794e-03 1.000e+00 0.000e+00 1.000e+00
 9 47253 x 19247   0  45   4.772e-03 1.000e+00 0.000e+00 1.000e+00
10 19247 x 7925    0  43   5.168e-03 1.000e+00 0.000e+00 1.000e+00
11  7925 x 3587    0  29   4.716e-03 1.000e+00 0.000e+00 1.000e+00
12  3587 x 1766    0  29   1.077e-02 1.000e+00 0.000e+00 1.000e+00
13  1766 x 892     1  22   1.889e-02 1.000e+00 1.080e-01 1.000e+00
14   892 x 393     1  13   3.265e-02 1.000e+00 5.206e-01 1.000e+00
15   393 x 169     1  14   2.649e-02 1.000e+00 1.151e-01 1.000e+00
16   169 x 60      1   7   2.438e-02 1.000e+00 9.336e-02 1.000e+00
17    60 x 20      1   4   7.697e-02 1.000e+00 4.691e-01 1.000e+00
18    20 x 6       1   3   2.029e-01 9.997e-01 2.029e-01 1.000e+00

     Complexity:    grid = 1.954694
                operator = 8.234441
                memory = 8.836133

BoomerAMG SOLVER PARAMETERS:

  Maximum number of cycles:         1000
  Stopping Tolerance:               1.000000e-08
  Cycle type (1 = V, 2 = W, etc.):  1

  Relaxation Parameters:
   Visiting Grid:                     down   up  coarse
            Number of sweeps:            1    1     1
   Type 0=Jac, 3=hGS, 6=hSGS, 9=GE:      3    3     9
   Point types, partial sweeps (1=C, -1=F):
                  Pre-CG relaxation (down):   1  -1
                   Post-CG relaxation (up):  -1   1
                             Coarsest grid:   0

Anyway, I'll ignore that for now since it is faster 2 out of 3 times in 2.24.0, however I did one more test on 2.24.0 with replacing the strength threshold to 0.25 (default) and removing the incorrect truncation factor to see if that recovers more closely the older version trends but I see mixed trend (e.g. I-dir solve took 81 iterations compared to 61 before and took longer by ~500s, but J-dir solve was faster and took fewer iterations at 40 and faster by ~500s) so it's a mixed bag, hard to tell why old settings are working on 2.11.1 and not on the latest, whereas new ones work on 2.24.0; guess I'll keep it at 0.5 since that is recommended for 3-d; before trying your suggestion for interp types 17 or 18!

Update - interpolation type 17 gave slightly improved result over choice 6

@rfalgout @ulrikeyang et al update - so with the recommended settings for BoomerAMG and with openmp enabled solver was able to converge OK for all the solves above; in general I noted a ~2.5x speedup for I/J-dir solves and ~5x speedup for K-dir solves with 16 threads enabled. Not sure if that is along expected lines or not ... planning to try with GPU enabled next, noted that some solver options are not supported for GPU, guess one question - can we generally expect greater speedup for GPU multithreaded vs CPU (openmp)? Thank you.

GPU acceleration depends on the parameters of AMG (one should use all GPU enabled algorithms, see https://github.com/hypre-space/hypre/wiki/GPUs) and the size of problem per GPU (in general the larger the better as long as it can fit the memory).

Thanks @liruipeng et al, When I try to configure with GPU, I see the following cmake configuration warning in Windows 10:

-- Looking for a CUDA compiler
-- Looking for a CUDA compiler - C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.6/bin/nvcc.exe
-- Looking for a CUDA host compiler - C:/Program Files (x86)/Microsoft Visual Studio/2019/Professional/VC/Tools/MSVC/14.29.30133/bin/Hostx64/x64/cl.exe
CMake Warning at C:/ProgramData/Miniconda3/Lib/site-packages/cmake/data/share/cmake-3.22/Modules/CMakeDetermineCUDACompiler.cmake:15 (message):
  Visual Studio does not support specifying CUDAHOSTCXX or
  CMAKE_CUDA_HOST_COMPILER.  Using the C++ compiler provided by Visual
  Studio.

Following that it reports that it detected the CUDA compiler and the CUDA toolkit:

-- The CUDA compiler identification is NVIDIA 11.6.124
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.6/bin/nvcc.exe - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Enabled support for CUDA.
-- Using CUDA architecture: 70
-- Found CUDA: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.6 (found version "11.6")
-- Found CUDAToolkit: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.6/include (found version "11.6.124")
-- Configuring done
-- Generating done
-- Build files have been written to: E:/devl/hypre/hypre-2.24.0/src/cuda_build

Can the above warning be ignored? Can I still launch the solution with VS and build/install HYPRE as usual, or do I have to additionally modify some settings of the HYPRE project in the VS solution?

EDIT: I tried building from the VS solution, and seems to be working as intended, it is building the relevant files of the project with the CUDA compiler. However I did see errors complaining about some CUDA pre-compiler directives, for one of the files:

2>E:\devl\hypre\hypre-2.24.0\src\seq_mv\csr_matvec_device.c(187): error : "#" not expected here
2>
2>E:\devl\hypre\hypre-2.24.0\src\seq_mv\csr_matvec_device.c(187): error : expected an expression
2>
2>E:\devl\hypre\hypre-2.24.0\src\seq_mv\csr_matvec_device.c(187): error : too many arguments in function call
2>
2>14 errors detected in the compilation of "E:/devl/hypre/hypre-2.24.0/src/seq_mv/csr_matvec_device.c".

Similarly for the next call:

1>E:\devl\hypre\hypre-2.24.0\src\seq_mv\csr_matvec_device.c(183): error : "#" not expected here
1>
1>E:\devl\hypre\hypre-2.24.0\src\seq_mv\csr_matvec_device.c(183): error : expected an expression
1>
1>E:\devl\hypre\hypre-2.24.0\src\seq_mv\csr_matvec_device.c(183): error : too many arguments in function call
1>
1>Done building project "HYPRE.vcxproj".
1>7 errors detected in the compilation of "E:/devl/hypre/hypre-2.24.0/src/seq_mv/csr_matvec_device.c".

After deleting/commenting the inactive lines of the directive in question, it seems to compile OK.

Then I tried building the test project ij and ran into this:

1>ij.c
1>E:\devl\hypre\hypre-2.24.0\src\test\ij.c(952): fatal error C1061: compiler limit: blocks nested too deeply

Got past those by commenting some of the conditional checks, but running gives this error:

Running with these driver parameters:
  solver ID    = 0

CUDA ERROR (code = 35, CUDA driver version is insufficient for CUDA runtime version) at E:\devl\hypre\hypre-2.24.0\src\utilities\general.c:194

hypre-space / hypre

Question: best way to achieve OpenMP parallelization #564