hypre-space / hypre

Parallel solvers for sparse linear systems featuring multigrid methods.
https://www.llnl.gov/casc/hypre/
Other
674 stars 185 forks source link

PCG behavior #261

Open ptsuji opened 3 years ago

ptsuji commented 3 years ago

Hi Hypre team,

I'm trying to update to Hypre 2.20.0 (from version 2.15.1). Some of the tests I'm trying to run with PCG fail when going to this new version. What I see from the solver output is that PCG stops after the first 3 iterations:

    Iters       ||r||_2     conv.rate  ||r||_2/||b||_2
    -----    ------------   ---------  ------------
          1    6.703413e+00    0.497382    4.973819e-01
          2    7.900646e+00    1.178600    5.862145e-01
          3    1.345706e+01    1.703286    9.984911e-01

    HYPRE_LSC::launchSolver ERROR : in PCG solve.

    In the older version of Hypre, PCG continues even if the residual increases
    the first few iterations (the first 3 iterations are the same):

    Iters       ||r||_2     conv.rate  ||r||_2/||b||_2
    -----    ------------   ---------  ------------
          1    6.703413e+00    0.497382    4.973819e-01
          2    7.900646e+00    1.178600    5.862145e-01
          3    1.345706e+01    1.703286    9.984911e-01
          4    3.918493e+00    0.291185    2.907455e-01
          5    7.879527e+00    2.010857    5.846476e-01
          6    2.896081e+00    0.367545    2.148843e-01
          7    2.171334e+00    0.749749    1.611093e-01
          8    3.586211e+00    1.651616    2.660908e-01
          9    1.664137e+00    0.464038    1.234762e-01
         10    1.442316e+00    0.866705    1.070174e-01
         11    1.516017e+00    1.051099    1.124859e-01
         12    8.968874e-01    0.591608    6.654752e-02
         13    6.587533e-01    0.734488    4.887838e-02
         14    6.157143e-01    0.934666    4.568496e-02
         15    6.119204e-01    0.993838    4.540346e-02
         16    5.541675e-01    0.905620    4.111829e-02
         17    5.332409e-01    0.962238    3.956557e-02
         18    5.307207e-01    0.995274    3.937858e-02
         19    5.109586e-01    0.962764    3.791226e-02
         20    5.104054e-01    0.998917    3.787121e-02
         21    5.022269e-01    0.983977    3.726439e-02
         22    4.919717e-01    0.979580    3.650346e-02
         23    4.905653e-01    0.997141    3.639912e-02
         24    4.902818e-01    0.999422    3.637808e-02
         25    4.869365e-01    0.993177    3.612986e-02
         26    4.858042e-01    0.997675    3.604585e-02
         27    4.850035e-01    0.998352    3.598644e-02
         28    4.779962e-01    0.985552    3.546651e-02
         29    4.771695e-01    0.998270    3.540517e-02
         30    4.840823e-01    1.014487    3.591809e-02
         31    4.838078e-01    0.999433    3.589772e-02
         32    4.495752e-01    0.929243    3.335772e-02
         33    4.219088e-01    0.938461    3.130492e-02
         34    4.168943e-01    0.988115    3.093285e-02
         35    4.138626e-01    0.992728    3.070790e-02
         36    4.116622e-01    0.994683    3.054464e-02
         37    3.969393e-01    0.964236    2.945222e-02
         38    3.283768e-01    0.827272    2.436500e-02
         39    2.643302e-01    0.804960    1.961285e-02
         40    2.900579e-01    1.097332    2.152181e-02
         41    2.002251e-01    0.690294    1.485636e-02
         42    1.333312e-01    0.665907    9.892952e-03
         43    7.048849e-02    0.528672    5.230127e-03
         44    2.662837e-02    0.377769    1.975780e-03
         45    7.941937e-03    0.298251    5.892783e-04
         46    2.979308e-03    0.375136    2.210596e-04
         47    1.880780e-03    0.631281    1.395507e-04
         48    1.954712e-03    1.039309    1.450363e-04
         49    1.797398e-03    0.919521    1.333639e-04
         50    6.136403e-04    0.341405    4.553108e-05
         51    2.588146e-04    0.421769    1.920361e-05
         52    9.700821e-05    0.374817    7.197845e-06
         53    3.557732e-05    0.366745    2.639777e-06
         54    1.506340e-05    0.423399    1.117679e-06
         55    9.694517e-06    0.643581    7.193167e-07
         56    5.754037e-06    0.593535    4.269398e-07
         57    7.657231e-06    1.330758    5.681536e-07
         58    1.049804e-05    1.370996    7.789365e-07
         59    6.813330e-06    0.649010    5.055375e-07
         60    5.542970e-06    0.813548    4.112790e-07
         61    5.292748e-06    0.954858    3.927130e-07
         62    2.456764e-06    0.464175    1.822877e-07
         63    2.964145e-06    1.206524    2.199345e-07
         64    1.593294e-06    0.537522    1.182197e-07
         65    8.480699e-07    0.532275    6.292535e-08
         66    1.179061e-06    1.390288    8.748435e-08
         67    7.109129e-07    0.602948    5.274854e-08
         68    7.920460e-07    1.114125    5.876847e-08
         69    5.242589e-07    0.661905    3.889912e-08
         70    6.222640e-07    1.186940    4.617094e-08
         71    3.547269e-07    0.570059    2.632013e-08
         72    1.713761e-07    0.483121    1.271582e-08
         73    1.785431e-07    1.041820    1.324760e-08
         74    1.749397e-07    0.979818    1.298023e-08
         75    9.659824e-08    0.552180    7.167426e-09

How do we recover the old behavior of PCG? Is there a flag/input that we need to set?

Thanks,

Paul

rfalgout commented 3 years ago

Hi @ptsuji . Would you mind adding --with-print-errors to your configure line for hypre, recompiling, rerunning this case, and letting us know if an error message is printed (and what the message says)? Thanks!

ptsuji commented 3 years ago

Hi @rfalgout,

After building with --with-print-errors in the configuration, the output looks like this:

Iters r _2 conv.rate r _2/ b _2
1    6.703413e+00    0.497382    4.973819e-01
2    7.900646e+00    1.178600    5.862145e-01
3    1.345706e+01    1.703286    9.984911e-01

hypre error in file "pcg.c", line 682, error code = 256 - Subnormal gamma value in PCG hypre error in file "pcg.c", line 682, error code = 256 - Subnormal gamma value in PCG hypre error in file "pcg.c", line 682, error code = 256 - Subnormal gamma value in PCG hypre error in file "pcg.c", line 682, error code = 256 - Subnormal gamma value in PCG hypre error in file "pcg.c", line 682, error code = 256 - Subnormal gamma value in PCG hypre error in file "pcg.c", line 682, error code = 256 - Subnormal gamma value in PCG hypre error in file "pcg.c", line 682, error code = 256 - Subnormal gamma value in PCG hypre error in file "pcg.c", line 682, error code = 256 - Subnormal gamma value in PCG hypre error in file "pcg.c", line 682, error code = 256 - Subnormal gamma value in PCG hypre error in file "pcg.c", line 682, error code = 256 - Subnormal gamma value in PCG hypre error in file "pcg.c", line 682, error code = 256 - Subnormal gamma value in PCG hypre error in file "pcg.c", line 682, error code = 256 - Subnormal gamma value in PCG HYPRE_LSC::launchSolver ERROR : in PCG solve.

liruipeng commented 3 years ago

@ptsuji Is it possible that you can provide us this particular matrix? Thanks!

ptsuji commented 3 years ago

@liruipeng @rfalgout the matrix is on the RZ. Who can I give it to there?

liruipeng commented 3 years ago

@liruipeng @rfalgout the matrix is on the RZ. Who can I give it to there?

@ptsuji If this can be moved to CZ, you can give it to me. My username is li50. Thanks!

ptsuji commented 3 years ago

@liruipeng I gave you the matrix on quartz. This file is what is written out by the FEI (8211 x 8211 matrix with 200263 nonzeros). Thanks for looking at this!

liruipeng commented 3 years ago

@ptsuji Got the matrix. Thanks! Which preconditioner do you use with PCG? Do you have the preconditioner parameters?

ptsuji commented 3 years ago

@liruipeng here are some of the parameters/output:

paramStrings[0] = "amgPmax 4" paramStrings[1] = "amgInterpType 6" paramStrings[2] = "amgRelaxType hybridsym" paramStrings[3] = "amgCoarsenType hmis" paramStrings[4] = "amgAggLevels 1" paramStrings[5] = "solver cg" paramStrings[6] = "preconditioner boomeramg" paramStrings[7] = "outputLevel 2" paramStrings[8] = "hypre" paramStrings[9] = "amgStrongThreshold 3.000000e-01" paramStrings[10] = "maxIterations 500" paramStrings[11] = "tolerance 1.000000e-08"


ptsuji commented 3 years ago

@liruipeng also, I should mention that I saw this error on 12 processors. On 1 processor, the ordering from the FEI interface is different, and things actually converge.

liruipeng commented 3 years ago

@liruipeng also, I should mention that I saw this error on 12 processors. On 1 processor, the ordering from the FEI interface is different, and things actually converge.

Hi @ptsuji Thank you for the information. To reproduce the problem, it will be easier to know how the system was distributed. If you have access to the HYPRE_IJ objects that you give to PCG, can you add following lines in your code before PCG

HYPRE_IJMatrixPrint(ij_A, "IJ.out.A");
HYPRE_IJVectorPrint(ij_b, "IJ.out.b");
HYPRE_IJVectorPrint(ij_x, "IJ.out.x0");

to print matrix, right-hand-side vector and initial guess, and give the saved files (with 12 processes, each should be saved into 12 pieces) to me? Also, for verification, can you please also copy AMG hierarchy from hypre's output as the following one?

BoomerAMG SETUP PARAMETERS:

 Max levels = 25
 Num levels = 5

 Strength Threshold = 0.250000
 Interpolation Truncation Factor = 0.000000
 Maximum Row Sum Threshold for Dependency Weakening = 1.000000

 Coarsening Type = HMIS 
 measures are determined locally

 No global partition option chosen.

 Interpolation = extended+i interpolation

Operator Matrix Information:

             nonzero            entries/row          row sums
lev    rows  entries sparse   min  max     avg      min         max
======================================================================
  0    1000     6400  0.006     4    7     6.4   0.000e+00   3.000e+00
  1     500     7248  0.029     7   17    14.5   0.000e+00   4.000e+00
  2      99     3003  0.306    15   43    30.3   1.041e-02   5.319e+00
  3      14      188  0.959    11   14    13.4   5.274e+00   1.007e+01
  4       4       16  1.000     4    4     4.0   7.597e+00   9.196e+00

Interpolation Matrix Information:
                    entries/row        min        max            row sums
lev  rows x cols  min  max  avgW     weight      weight       min         max
================================================================================
  0  1000 x 500     1    4   4.0   1.667e-01   2.500e-01   5.000e-01   1.000e+00
  1   500 x 99      1    4   4.0   1.301e-02   3.547e-01   2.164e-01   1.000e+00
  2    99 x 14      1    4   4.0   1.247e-03   3.928e-01   2.865e-02   1.000e+00
  3    14 x 4       1    4   3.6  -6.320e-02   6.629e-02  -6.121e-02   1.000e+00

     Complexity:    grid = 1.617000
                operator = 2.633594
                memory = 3.350625

Thanks!