Belos: BiCGSTAB is broken

mhoemmen commented 6 years ago

@trilinos/belos @vbrunini @jclause

We noticed that Belos' BiCGSTAB implementation was failing to converge for a small (32 x 32) unpreconditioned nonsymmetric (but not badly so) linear system. The linear system has real eigenvalues and RCOND about 1.0e9. Both Belos' GMRES and AztecOO's BiCGSTAB handle it just fine.

I wrote my own Tpetra-only BiCGSTAB implementation based on the "Templates" book, in particular the Matlab code here. It converged just fine, in a number of iterations comparable to Belos' GMRES.

Motivation and Context

Many users of our application specify BiCGSTAB as a default solver in their input decks. Its memory use does not depend on the number of iterations / restart length, but it can handle nonsymmetric linear systems. This makes it a good compromise between CG and GMRES.

We would like BiCGSTAB to work to achieve our goal of replacing use of AztecOO with the Tpetra-based solver stack.

Possible Solution

As a temporary fix, use the framework in belos/tpetra/src/solvers to inject my Tpetra-only solver into Belos::SolverFactory.
Fix Belos::BiCGStab{Iter,SolMgr}. (It's not clear which class is causing the problem.) It may be easier just to use the known working Tpetra-only implementation as a guide to rewrite the existing implementation.

hkthorn commented 6 years ago

@mhoemmen Can you provide the small linear system that illustrated this issue? I've been looking at the BiCGStab implementation and I want to reproduce what you are seeing.

mhoemmen commented 6 years ago

@hkthorn I'll send it to you. I was out all last week in all-day meetings.

amklinv commented 6 years ago

That's unfortunate. I'm curious to find out what's causing the issue.

mhoemmen commented 6 years ago

I added some code to a branch to help with debugging this issue.

Receive the bicgstab-A.mm and bicgstab-b.mm files I'll send you.
Check out the "Debug-3787" branch on my Trilinos fork.
Enable Belos and Ifpack2, and enable Ifpack2 examples.
Build Trilinos.
Change to the packages/ifpack2/example/ subdirectory of the build directory, and run the following (with 1 MPI process):
```
./Ifpack2_RelaxationWithEquilibration.exe --matrixFilename=bicgstab-A.mm --rhsFilename=bicgstab-b.mm --convergenceTolerances=1.0e-8 --maxIters=25 --restartLengths=25 --no-equilibrate --solverTypes=BICGSTAB --preconditionerTypes=NONE --custom-bicgstab
```
The code will first print results for a hand-rolled BiCGSTAB; it converges. Then, it will print results for Belos' BiCGSTAB; it does not converge.

hkthorn commented 6 years ago

Hi Mark,

I have cloned your repository. On line 112 of ifpack2/example/RelaxationWithEquilibration.cpp, in the BiCGStab solver, there is:

p.update (-omega, v, 0.0); // p = p - omega*v

But that would just make p = omega*v, since beta is 0.0. Unless I don’t understand the axpy interface of Tpetra, I’m suspicious of the direction vector update.

Thanks, Heidi

From: Mark Hoemmen notifications@github.com Reply-To: trilinos/Trilinos reply@reply.github.com Date: Monday, November 12, 2018 at 3:46 PM To: trilinos/Trilinos Trilinos@noreply.github.com Cc: Heidi Thornquist hkthorn@sandia.gov, Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [trilinos/Trilinos] Belos: BiCGSTAB is broken (#3787)

I added some code to a branch to help with debugging this issue.

Receive the bicgstab-A.mm and bicgstab-b.mm files I'll send you.
Check out the "Debug-3787" branch on my Trilinos forkhttps://github.com/mhoemmen/Trilinos/tree/Debug-3787.
Enable Belos and Ifpack2, and enable Ifpack2 examples.
Build Trilinos.
Change to the packages/ifpack2/example/ subdirectory of the build directory, and run the following (with 1 MPI process):

./Ifpack2_RelaxationWithEquilibration.exe --matrixFilename=bicgstab-A.mm --rhsFilename=bicgstab-b.mm --convergenceTolerances=1.0e-8 --maxIters=25 --restartLengths=25 --no-equilibrate --solverTypes=BICGSTAB --preconditionerTypes=NONE --custom-bicgstab

The code will first print results for a hand-rolled BiCGSTAB; it converges. Then, it will print results for Belos' BiCGSTAB; it does not converge.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/3787#issuecomment-438055515, or mute the threadhttps://github.com/notifications/unsubscribe-auth/APLX1kwNbIsx3n8a-8pn4lRMBd9-mwrAks5uufpGgaJpZM4YFZru.

mhoemmen commented 6 years ago

@hkthorn wrote:

But that would just make p = omega*v, since beta is 0.0. Unless I don’t understand the axpy interface of Tpetra, I’m suspicious of the direction vector update.

Interesting.... What's going on with Belos' BiCGSTAB, though? AztecOO's BiCGSTAB seems fine. I'll try the problem with this change to the hand-rolled BiCGSTAB.

hkthorn commented 6 years ago

The choice of rhat has a lot to do with the convergence of BiCGStab. AztecOO allows a random vector to be used as rhat, but Belos only allows the use of rhat=r.

What is the default for AztecOO?

Heidi

From: Mark Hoemmen notifications@github.com Reply-To: trilinos/Trilinos reply@reply.github.com Date: Tuesday, November 13, 2018 at 10:05 AM To: trilinos/Trilinos Trilinos@noreply.github.com Cc: Heidi Thornquist hkthorn@sandia.gov, Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [trilinos/Trilinos] Belos: BiCGSTAB is broken (#3787)

@hkthornhttps://github.com/hkthorn wrote:

But that would just make p = omega*v, since beta is 0.0. Unless I don’t understand the axpy interface of Tpetra, I’m suspicious of the direction vector update.

Interesting.... What's going on with Belos' BiCGSTAB, though? AztecOO's BiCGSTAB seems fine. I'll try the problem with this change to the hand-rolled BiCGSTAB.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/3787#issuecomment-438350867, or mute the threadhttps://github.com/notifications/unsubscribe-auth/APLX1mLAIde9F3cnQNbCWC7t6e2Duzq6ks5uuvuugaJpZM4YFZru.

mhoemmen commented 6 years ago

@hkthorn wrote:

But that would just make p = omega*v, since beta is 0.0. Unless I don’t understand the axpy interface of Tpetra, I’m suspicious of the direction vector update.

Data point: When I change beta from 0.0 to 1.0, the hand-rolled BiCGSTAB implementation blows up (tolerance on exit is 1.1e26).

mhoemmen commented 6 years ago

The hand-rolled BiCGSTAB implementation is weird in multiple ways. Note the "oddly enough" comment below the p update. Assigning r to p actually makes p alias r; it's a shallow copy, not a deep copy. If I change that into a deep copy and fix the p update, the solver doesn't converge.

In any case, we should fix Belos, not some hand-rolled solver, but I thought this was interesting.

mhoemmen commented 6 years ago

AztecOO uses the residual vector as rhat by default. Fun fact: AztecOO's documentation misspells "tilde" ;-)

If you're using AztecOO as a smoother (options[AZ_precond] == AZ_smoother is true), it does some recursive thing and invokes TFQMR with a random rhat.

mhoemmen commented 6 years ago

@hkthorn I just pushed a commit to the Debug-3787 branch that adds a reimplementation of AztecOO's BiCGSTAB to the Ifpack2 example. It also fails to converge, with (suspiciously) the same convergence result.

mhoemmen commented 6 years ago

I'm actually worried that there might be something wrong with the Tpetra operation z = alpha*x + beta*y + gamma*z, so I'm rewriting the hand-rolled version not to do that, in order to test.

mhoemmen commented 6 years ago

@hkthorn My hand-rolled exact duplicate of Hank van der Vorst's unpreconditioned BiCGSTAB (from his 1992 paper) gets exactly the same residual norm as Belos: 0.640084. My hand-rolled AztecOO imitation gets almost the same result: 0.632222. This is really quite weird.

mhoemmen commented 6 years ago

I just pushed a new commit to Debug-3787 that includes the exact duplicate of Hank van der Vorst's unpreconditioned BiCGSTAB. I'm on a roll now -- I might try implementing TFQMR from the paper, just to try it out. (Belos' TFQMR doesn't converge on this problem either, though GMRES does fine.)

mhoemmen commented 6 years ago

I wrote three different independent implementations of BiCGSTAB. Belos' BiCGSTAB behaves exactly like a version I took directly from a standard textbook ("Templates for the Solution of Linear Systems," 2nd ed.), when I ran it on this standard tricky nonsymmetric test problem. Thus, I'm tempted to call this "not a bug." BiCGSTAB doesn't always live up to the last four letters of its name; hence the various generalizations. I would like to experiment a bit more on whether I can get better results in the "textbook BiCGSTAB implementation" by using more accurate summation.

mhoemmen commented 6 years ago

I did a quick version of textbook BiCGSTAB that uses an extended-precision accumulator for dot products. It gets only slightly different results than Belos for the same challenging problem mentioned in my previous comment.

mhoemmen commented 5 years ago

I ran Matlab's BiCGSTAB with this linear system, using the following commands. Note the extra iteration: maxiter = 1 means "no iterations" in Matlab, while it means "one iteration" in Belos.

A = mmread('bicgstab-A.mm');
b = mmread('bicgstab-b.mm');
format longe
x = bicgstab(A, b, 1.0e-8, 26);

Matlab said this:

bicgstab stopped at iteration 26 without converging to the desired tolerance 1e-08
because the maximum number of iterations was reached.
The iterate returned (number 26) has relative residual 0.63.

x =

                         0
                         0
     1.018706342509145e-06
     3.106246998694506e-07
                         0
                         0
    -1.489907781725699e-06
    -2.755578586906677e-06
    -3.904344514393062e+01
    -3.904344513469398e+01
                         0
                         0
    -3.904344513345153e+01
    -3.904344514269527e+01
                         0
                         0
    -4.176390301377935e-04
    -4.175912689984014e-04
    -1.477324719305449e+01
    -1.477324716474708e+01
    -4.175909896110095e-04
    -4.176388101393951e-04
    -1.477324717518767e+01
    -1.477324718885252e+01
    -7.833883829556076e-07
     1.845734890680782e-06
    -7.850694917863062e-06
     5.363160481747733e-06
     1.516653930257561e-06
    -1.264329923322757e-06
     6.040718355994066e-06
    -7.164515732579283e-06

Belos reports a relative residual error of 0.640, and gives this as the solution:

0.00000000000000000e+00
0.00000000000000000e+00
3.42321892659527140e-06
2.79296788427004739e-06
0.00000000000000000e+00
0.00000000000000000e+00
-3.86052786452141738e-06
-5.66685238017876838e-06
-3.90434113782005028e+01
-3.90434113712837885e+01
0.00000000000000000e+00
0.00000000000000000e+00
-3.90434113705111372e+01
-3.90434113774316245e+01
0.00000000000000000e+00
0.00000000000000000e+00
-1.92476861384790609e-04
-1.92291643665078762e-04
-1.47732014193323113e+01
-1.47732013637051107e+01
-1.92290718391142977e-04
-1.92476005575872329e-04
-1.47732013710402530e+01
-1.47732014120244113e+01
2.38919243632378751e-05
1.73033226130662394e-05
-2.40939038569352856e-05
2.37453390938814468e-05
-2.30741626286499365e-05
-1.66146772546574630e-05
2.35872421503866493e-05
-2.42429205887435850e-05

I implemented BiCGSTAB by hand, using the original paper as a guide, and got a relative residual error of 0.639, with the following solution:

0.00000000000000000e+00
0.00000000000000000e+00
2.98449450385562106e-06
2.30053748780245696e-06
0.00000000000000000e+00
0.00000000000000000e+00
-3.42383935628271692e-06
-5.16840817187819236e-06
-3.90432952249595147e+01
-3.90432951947169897e+01
0.00000000000000000e+00
0.00000000000000000e+00
-3.90432951925015246e+01
-3.90432952227435663e+01
0.00000000000000000e+00
0.00000000000000000e+00
-1.92221283847275479e-04
-1.91960689248206847e-04
-1.47730437918483091e+01
-1.47730436303738095e+01
-1.92012252376978369e-04
-1.92272916839584862e-04
-1.47730436149719342e+01
-1.47730437618057664e+01
1.75264020822610523e-05
1.21120626017265406e-05
-2.92505326334088982e-05
1.52001639247208350e-05
-1.67088058363647455e-05
-1.14242200897990885e-05
2.88818631465462959e-05
-1.55597482456305498e-05

I wrote a version of BiCGSTAB based on AztecOO's implementation of BiCGSTAB, and got a relative residual error of 0.686, with the following solution:

0.00000000000000000e+00
0.00000000000000000e+00
5.26604106805142545e-06
4.75466569450008301e-06
0.00000000000000000e+00
0.00000000000000000e+00
-5.68690523734606036e-06
-7.80649125952679973e-06
-3.90433587737340915e+01
-3.90433587609749537e+01
0.00000000000000000e+00
0.00000000000000000e+00
-3.90433587554064019e+01
-3.90433587681659020e+01
0.00000000000000000e+00
0.00000000000000000e+00
-1.04268126294236623e-04
-1.04016822009294564e-04
-1.47731300134012180e+01
-1.47731299285751376e+01
-1.03854951213494566e-04
-1.04106329622213268e-04
-1.47731299716323576e+01
-1.47731300418162323e+01
4.77675044858555275e-05
3.56533311286728134e-05
-1.25315220181633497e-05
5.43775533170671842e-05
-4.69161304260747215e-05
-3.49211505781280981e-05
1.23332564578117354e-05
-5.45665656211069032e-05

Weirdly, the other version of BiCGSTAB that I wrote, that is almost certainly incorrect, managed to solve the problem in 20 iterations with a relative residual error of 6.30e-9, and got a different solution than any of these methods.

mhoemmen commented 5 years ago

AztecOO's Gauss-Seidel implementation reorders the linear system by default. I tried running Belos + Ifpack2 with the reordering option. This requires the following parameter settings:

  precType = "SCHWARZ";
  precParams->set ("schwarz: subdomain solver name", "RELAXATION");
  {
    ParameterList& relaxParams =
      precParams->sublist ("schwarz: subdomain solver parameters", false);
    relaxParams.set ("relaxation: type", "Symmetric Gauss-Seidel");
  }
  precParams->set ("schwarz: overlap level", int (0));
  precParams->set ("schwarz: use reordering", true);

However, this did not help. I got a relative residual error of 124, and the following solution:

0.00000000000000000e+00
0.00000000000000000e+00
1.78451073609599884e-10
-5.50751482580689631e-10
0.00000000000000000e+00
0.00000000000000000e+00
-2.55356299135419606e-14
-1.10506410560070033e-15
-7.46289448017023460e-08
7.86869445157378777e-08
0.00000000000000000e+00
0.00000000000000000e+00
-7.92209706723667750e-08
6.54848673065089315e-07
0.00000000000000000e+00
0.00000000000000000e+00
-7.96266786658743748e-17
2.00537267640978910e-16
-8.05600102999251663e-15
1.61591269201176099e-12
3.41217204913931490e-16
-9.83831779173299537e-17
5.38076565942402496e-10
-1.79886791684680247e-07
9.38363397143668342e-11
-1.37798014022924932e-10
-6.40007715052251236e-02
4.16964483507910946e-06
4.64299075368219110e-06
-3.12078708361021944e-08
7.39596448310157028e-06
-2.43685662199579915e-04

mhoemmen commented 5 years ago

My very last option is to try AztecOO's BiCGSTAB with Symmetric Gauss-Seidel preconditioning. I will work on this now.

mhoemmen commented 5 years ago

AztecOO's BiCGSTAB with no scaling and no preconditioner did not converge either, and got nearly the same residual norm.

        *******************************************************
        ***** Problem: Epetra::CrsMatrix
        ***** Preconditioned BICGSTAB solution
        ***** No preconditioning
        ***** No scaling
        *******************************************************

                iter:    0           residual = 1.000000e+00
                iter:    1           residual = 2.116599e+00
                iter:    2           residual = 7.737321e-01
                iter:    3           residual = 7.376869e-01
                iter:    4           residual = 7.386195e-01
                iter:    5           residual = 8.283596e-01
                iter:    6           residual = 7.585078e-01
                iter:    7           residual = 7.520084e-01
                iter:    8           residual = 6.167340e-01
                iter:    9           residual = 6.682432e+00
                iter:   10           residual = 7.471754e-01
                iter:   11           residual = 5.883619e-01
                iter:   12           residual = 8.319245e-01
                iter:   13           residual = 6.173222e-01
                iter:   14           residual = 8.792509e-01
                iter:   15           residual = 8.765567e-01
                iter:   16           residual = 1.132873e+00
                iter:   17           residual = 9.353814e-01
                iter:   18           residual = 9.196676e-01
                iter:   19           residual = 8.960372e-01
                iter:   20           residual = 8.857277e-01
                iter:   21           residual = 8.795586e-01
                iter:   22           residual = 8.312507e-01
                iter:   23           residual = 8.033678e-01
                iter:   24           residual = 7.429115e-01
                iter:   25           residual = 6.641268e-01

    ***************************************************************

    Warning: maximum number of iterations exceeded
    without convergence

    Solver:         bicgstab
    number of iterations:   25

    Recursive residual =  3.5175e-01

    Calculated Norms                Requested Norm
    --------------------------------------------    --------------

    ||r||_2 / ||r0||_2:     6.641268e-01    1.000000e-08

    ***************************************************************

        Solution time: 0.003654 (sec.)
        total iterations: 25
Solver:
  Solver type: AztecOO BiCGSTAB
  Preconditioner type: DEFAULT
  Convergence tolerance: 1e-08
  Maximum number of iterations: 25
Results:
  Converged: 0
  Number of iterations: 25
  Achieved tolerance: 0.664127

mhoemmen commented 5 years ago

I'm calling this not a bug.

mhoemmen commented 5 years ago

@hkthorn FYI, the very first "hand-rolled BiCGSTAB" that I wrote, that appeared to converge just fine, gives a bogus actual solution.

Solver:
  Solver type: Custom BiCGSTAB
  Preconditioner type: NONE
  Convergence tolerance: 1e-08
  Maximum number of iterations: 25
Results:
  Converged: 1
  Number of iterations: 20
  Achieved tolerance: 6.29728e-09
  Explicit ||R||_2 / ||B||_2: 3.11918

Subsequent hand-rolled implementations of BiCGSTAB that I wrote behave more reasonably.

amklinv commented 5 years ago

Of course it's not a bug. It's a feature!

trilinos / Trilinos

Belos: BiCGSTAB is broken #3787

Motivation and Context

Possible Solution