Closed mhoemmen closed 5 years ago
@mhoemmen Can you provide the small linear system that illustrated this issue? I've been looking at the BiCGStab implementation and I want to reproduce what you are seeing.
@hkthorn I'll send it to you. I was out all last week in all-day meetings.
That's unfortunate. I'm curious to find out what's causing the issue.
I added some code to a branch to help with debugging this issue.
packages/ifpack2/example/
subdirectory of the build directory, and run the following (with 1 MPI process):
./Ifpack2_RelaxationWithEquilibration.exe --matrixFilename=bicgstab-A.mm --rhsFilename=bicgstab-b.mm --convergenceTolerances=1.0e-8 --maxIters=25 --restartLengths=25 --no-equilibrate --solverTypes=BICGSTAB --preconditionerTypes=NONE --custom-bicgstab
The code will first print results for a hand-rolled BiCGSTAB; it converges. Then, it will print results for Belos' BiCGSTAB; it does not converge.
Hi Mark,
I have cloned your repository. On line 112 of ifpack2/example/RelaxationWithEquilibration.cpp, in the BiCGStab solver, there is:
p.update (-omega, v, 0.0); // p = p - omega*v
But that would just make p = omega*v, since beta is 0.0. Unless I don’t understand the axpy interface of Tpetra, I’m suspicious of the direction vector update.
Thanks, Heidi
From: Mark Hoemmen notifications@github.com Reply-To: trilinos/Trilinos reply@reply.github.com Date: Monday, November 12, 2018 at 3:46 PM To: trilinos/Trilinos Trilinos@noreply.github.com Cc: Heidi Thornquist hkthorn@sandia.gov, Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [trilinos/Trilinos] Belos: BiCGSTAB is broken (#3787)
I added some code to a branch to help with debugging this issue.
./Ifpack2_RelaxationWithEquilibration.exe --matrixFilename=bicgstab-A.mm --rhsFilename=bicgstab-b.mm --convergenceTolerances=1.0e-8 --maxIters=25 --restartLengths=25 --no-equilibrate --solverTypes=BICGSTAB --preconditionerTypes=NONE --custom-bicgstab
The code will first print results for a hand-rolled BiCGSTAB; it converges. Then, it will print results for Belos' BiCGSTAB; it does not converge.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/3787#issuecomment-438055515, or mute the threadhttps://github.com/notifications/unsubscribe-auth/APLX1kwNbIsx3n8a-8pn4lRMBd9-mwrAks5uufpGgaJpZM4YFZru.
@hkthorn wrote:
But that would just make p = omega*v, since beta is 0.0. Unless I don’t understand the axpy interface of Tpetra, I’m suspicious of the direction vector update.
Interesting.... What's going on with Belos' BiCGSTAB, though? AztecOO's BiCGSTAB seems fine. I'll try the problem with this change to the hand-rolled BiCGSTAB.
The choice of rhat has a lot to do with the convergence of BiCGStab. AztecOO allows a random vector to be used as rhat, but Belos only allows the use of rhat=r.
What is the default for AztecOO?
Heidi
From: Mark Hoemmen notifications@github.com Reply-To: trilinos/Trilinos reply@reply.github.com Date: Tuesday, November 13, 2018 at 10:05 AM To: trilinos/Trilinos Trilinos@noreply.github.com Cc: Heidi Thornquist hkthorn@sandia.gov, Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [trilinos/Trilinos] Belos: BiCGSTAB is broken (#3787)
@hkthornhttps://github.com/hkthorn wrote:
But that would just make p = omega*v, since beta is 0.0. Unless I don’t understand the axpy interface of Tpetra, I’m suspicious of the direction vector update.
Interesting.... What's going on with Belos' BiCGSTAB, though? AztecOO's BiCGSTAB seems fine. I'll try the problem with this change to the hand-rolled BiCGSTAB.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/3787#issuecomment-438350867, or mute the threadhttps://github.com/notifications/unsubscribe-auth/APLX1mLAIde9F3cnQNbCWC7t6e2Duzq6ks5uuvuugaJpZM4YFZru.
@hkthorn wrote:
But that would just make p = omega*v, since beta is 0.0. Unless I don’t understand the axpy interface of Tpetra, I’m suspicious of the direction vector update.
Data point: When I change beta from 0.0 to 1.0, the hand-rolled BiCGSTAB implementation blows up (tolerance on exit is 1.1e26).
The hand-rolled BiCGSTAB implementation is weird in multiple ways. Note the "oddly enough" comment below the p update. Assigning r to p actually makes p alias r; it's a shallow copy, not a deep copy. If I change that into a deep copy and fix the p update, the solver doesn't converge.
In any case, we should fix Belos, not some hand-rolled solver, but I thought this was interesting.
AztecOO uses the residual vector as rhat by default. Fun fact: AztecOO's documentation misspells "tilde" ;-)
If you're using AztecOO as a smoother (options[AZ_precond] == AZ_smoother
is true
), it does some recursive thing and invokes TFQMR with a random rhat.
@hkthorn I just pushed a commit to the Debug-3787 branch that adds a reimplementation of AztecOO's BiCGSTAB to the Ifpack2 example. It also fails to converge, with (suspiciously) the same convergence result.
I'm actually worried that there might be something wrong with the Tpetra operation z = alpha*x + beta*y + gamma*z
, so I'm rewriting the hand-rolled version not to do that, in order to test.
@hkthorn My hand-rolled exact duplicate of Hank van der Vorst's unpreconditioned BiCGSTAB (from his 1992 paper) gets exactly the same residual norm as Belos: 0.640084. My hand-rolled AztecOO imitation gets almost the same result: 0.632222. This is really quite weird.
I just pushed a new commit to Debug-3787 that includes the exact duplicate of Hank van der Vorst's unpreconditioned BiCGSTAB. I'm on a roll now -- I might try implementing TFQMR from the paper, just to try it out. (Belos' TFQMR doesn't converge on this problem either, though GMRES does fine.)
I wrote three different independent implementations of BiCGSTAB. Belos' BiCGSTAB behaves exactly like a version I took directly from a standard textbook ("Templates for the Solution of Linear Systems," 2nd ed.), when I ran it on this standard tricky nonsymmetric test problem. Thus, I'm tempted to call this "not a bug." BiCGSTAB doesn't always live up to the last four letters of its name; hence the various generalizations. I would like to experiment a bit more on whether I can get better results in the "textbook BiCGSTAB implementation" by using more accurate summation.
I did a quick version of textbook BiCGSTAB that uses an extended-precision accumulator for dot products. It gets only slightly different results than Belos for the same challenging problem mentioned in my previous comment.
I ran Matlab's BiCGSTAB with this linear system, using the following commands. Note the extra iteration: maxiter = 1 means "no iterations" in Matlab, while it means "one iteration" in Belos.
A = mmread('bicgstab-A.mm');
b = mmread('bicgstab-b.mm');
format longe
x = bicgstab(A, b, 1.0e-8, 26);
Matlab said this:
bicgstab stopped at iteration 26 without converging to the desired tolerance 1e-08
because the maximum number of iterations was reached.
The iterate returned (number 26) has relative residual 0.63.
x =
0
0
1.018706342509145e-06
3.106246998694506e-07
0
0
-1.489907781725699e-06
-2.755578586906677e-06
-3.904344514393062e+01
-3.904344513469398e+01
0
0
-3.904344513345153e+01
-3.904344514269527e+01
0
0
-4.176390301377935e-04
-4.175912689984014e-04
-1.477324719305449e+01
-1.477324716474708e+01
-4.175909896110095e-04
-4.176388101393951e-04
-1.477324717518767e+01
-1.477324718885252e+01
-7.833883829556076e-07
1.845734890680782e-06
-7.850694917863062e-06
5.363160481747733e-06
1.516653930257561e-06
-1.264329923322757e-06
6.040718355994066e-06
-7.164515732579283e-06
Belos reports a relative residual error of 0.640, and gives this as the solution:
0.00000000000000000e+00
0.00000000000000000e+00
3.42321892659527140e-06
2.79296788427004739e-06
0.00000000000000000e+00
0.00000000000000000e+00
-3.86052786452141738e-06
-5.66685238017876838e-06
-3.90434113782005028e+01
-3.90434113712837885e+01
0.00000000000000000e+00
0.00000000000000000e+00
-3.90434113705111372e+01
-3.90434113774316245e+01
0.00000000000000000e+00
0.00000000000000000e+00
-1.92476861384790609e-04
-1.92291643665078762e-04
-1.47732014193323113e+01
-1.47732013637051107e+01
-1.92290718391142977e-04
-1.92476005575872329e-04
-1.47732013710402530e+01
-1.47732014120244113e+01
2.38919243632378751e-05
1.73033226130662394e-05
-2.40939038569352856e-05
2.37453390938814468e-05
-2.30741626286499365e-05
-1.66146772546574630e-05
2.35872421503866493e-05
-2.42429205887435850e-05
I implemented BiCGSTAB by hand, using the original paper as a guide, and got a relative residual error of 0.639, with the following solution:
0.00000000000000000e+00
0.00000000000000000e+00
2.98449450385562106e-06
2.30053748780245696e-06
0.00000000000000000e+00
0.00000000000000000e+00
-3.42383935628271692e-06
-5.16840817187819236e-06
-3.90432952249595147e+01
-3.90432951947169897e+01
0.00000000000000000e+00
0.00000000000000000e+00
-3.90432951925015246e+01
-3.90432952227435663e+01
0.00000000000000000e+00
0.00000000000000000e+00
-1.92221283847275479e-04
-1.91960689248206847e-04
-1.47730437918483091e+01
-1.47730436303738095e+01
-1.92012252376978369e-04
-1.92272916839584862e-04
-1.47730436149719342e+01
-1.47730437618057664e+01
1.75264020822610523e-05
1.21120626017265406e-05
-2.92505326334088982e-05
1.52001639247208350e-05
-1.67088058363647455e-05
-1.14242200897990885e-05
2.88818631465462959e-05
-1.55597482456305498e-05
I wrote a version of BiCGSTAB based on AztecOO's implementation of BiCGSTAB, and got a relative residual error of 0.686, with the following solution:
0.00000000000000000e+00
0.00000000000000000e+00
5.26604106805142545e-06
4.75466569450008301e-06
0.00000000000000000e+00
0.00000000000000000e+00
-5.68690523734606036e-06
-7.80649125952679973e-06
-3.90433587737340915e+01
-3.90433587609749537e+01
0.00000000000000000e+00
0.00000000000000000e+00
-3.90433587554064019e+01
-3.90433587681659020e+01
0.00000000000000000e+00
0.00000000000000000e+00
-1.04268126294236623e-04
-1.04016822009294564e-04
-1.47731300134012180e+01
-1.47731299285751376e+01
-1.03854951213494566e-04
-1.04106329622213268e-04
-1.47731299716323576e+01
-1.47731300418162323e+01
4.77675044858555275e-05
3.56533311286728134e-05
-1.25315220181633497e-05
5.43775533170671842e-05
-4.69161304260747215e-05
-3.49211505781280981e-05
1.23332564578117354e-05
-5.45665656211069032e-05
Weirdly, the other version of BiCGSTAB that I wrote, that is almost certainly incorrect, managed to solve the problem in 20 iterations with a relative residual error of 6.30e-9
, and got a different solution than any of these methods.
AztecOO's Gauss-Seidel implementation reorders the linear system by default. I tried running Belos + Ifpack2 with the reordering option. This requires the following parameter settings:
precType = "SCHWARZ";
precParams->set ("schwarz: subdomain solver name", "RELAXATION");
{
ParameterList& relaxParams =
precParams->sublist ("schwarz: subdomain solver parameters", false);
relaxParams.set ("relaxation: type", "Symmetric Gauss-Seidel");
}
precParams->set ("schwarz: overlap level", int (0));
precParams->set ("schwarz: use reordering", true);
However, this did not help. I got a relative residual error of 124, and the following solution:
0.00000000000000000e+00
0.00000000000000000e+00
1.78451073609599884e-10
-5.50751482580689631e-10
0.00000000000000000e+00
0.00000000000000000e+00
-2.55356299135419606e-14
-1.10506410560070033e-15
-7.46289448017023460e-08
7.86869445157378777e-08
0.00000000000000000e+00
0.00000000000000000e+00
-7.92209706723667750e-08
6.54848673065089315e-07
0.00000000000000000e+00
0.00000000000000000e+00
-7.96266786658743748e-17
2.00537267640978910e-16
-8.05600102999251663e-15
1.61591269201176099e-12
3.41217204913931490e-16
-9.83831779173299537e-17
5.38076565942402496e-10
-1.79886791684680247e-07
9.38363397143668342e-11
-1.37798014022924932e-10
-6.40007715052251236e-02
4.16964483507910946e-06
4.64299075368219110e-06
-3.12078708361021944e-08
7.39596448310157028e-06
-2.43685662199579915e-04
My very last option is to try AztecOO's BiCGSTAB with Symmetric Gauss-Seidel preconditioning. I will work on this now.
AztecOO's BiCGSTAB with no scaling and no preconditioner did not converge either, and got nearly the same residual norm.
*******************************************************
***** Problem: Epetra::CrsMatrix
***** Preconditioned BICGSTAB solution
***** No preconditioning
***** No scaling
*******************************************************
iter: 0 residual = 1.000000e+00
iter: 1 residual = 2.116599e+00
iter: 2 residual = 7.737321e-01
iter: 3 residual = 7.376869e-01
iter: 4 residual = 7.386195e-01
iter: 5 residual = 8.283596e-01
iter: 6 residual = 7.585078e-01
iter: 7 residual = 7.520084e-01
iter: 8 residual = 6.167340e-01
iter: 9 residual = 6.682432e+00
iter: 10 residual = 7.471754e-01
iter: 11 residual = 5.883619e-01
iter: 12 residual = 8.319245e-01
iter: 13 residual = 6.173222e-01
iter: 14 residual = 8.792509e-01
iter: 15 residual = 8.765567e-01
iter: 16 residual = 1.132873e+00
iter: 17 residual = 9.353814e-01
iter: 18 residual = 9.196676e-01
iter: 19 residual = 8.960372e-01
iter: 20 residual = 8.857277e-01
iter: 21 residual = 8.795586e-01
iter: 22 residual = 8.312507e-01
iter: 23 residual = 8.033678e-01
iter: 24 residual = 7.429115e-01
iter: 25 residual = 6.641268e-01
***************************************************************
Warning: maximum number of iterations exceeded
without convergence
Solver: bicgstab
number of iterations: 25
Recursive residual = 3.5175e-01
Calculated Norms Requested Norm
-------------------------------------------- --------------
||r||_2 / ||r0||_2: 6.641268e-01 1.000000e-08
***************************************************************
Solution time: 0.003654 (sec.)
total iterations: 25
Solver:
Solver type: AztecOO BiCGSTAB
Preconditioner type: DEFAULT
Convergence tolerance: 1e-08
Maximum number of iterations: 25
Results:
Converged: 0
Number of iterations: 25
Achieved tolerance: 0.664127
I'm calling this not a bug.
@hkthorn FYI, the very first "hand-rolled BiCGSTAB" that I wrote, that appeared to converge just fine, gives a bogus actual solution.
Solver:
Solver type: Custom BiCGSTAB
Preconditioner type: NONE
Convergence tolerance: 1e-08
Maximum number of iterations: 25
Results:
Converged: 1
Number of iterations: 20
Achieved tolerance: 6.29728e-09
Explicit ||R||_2 / ||B||_2: 3.11918
Subsequent hand-rolled implementations of BiCGSTAB that I wrote behave more reasonably.
Of course it's not a bug. It's a feature!
@trilinos/belos @vbrunini @jclause
We noticed that Belos' BiCGSTAB implementation was failing to converge for a small (32 x 32) unpreconditioned nonsymmetric (but not badly so) linear system. The linear system has real eigenvalues and RCOND about 1.0e9. Both Belos' GMRES and AztecOO's BiCGSTAB handle it just fine.
I wrote my own Tpetra-only BiCGSTAB implementation based on the "Templates" book, in particular the Matlab code here. It converged just fine, in a number of iterations comparable to Belos' GMRES.
Motivation and Context
Many users of our application specify BiCGSTAB as a default solver in their input decks. Its memory use does not depend on the number of iterations / restart length, but it can handle nonsymmetric linear systems. This makes it a good compromise between CG and GMRES.
We would like BiCGSTAB to work to achieve our goal of replacing use of AztecOO with the Tpetra-based solver stack.
Possible Solution
belos/tpetra/src/solvers
to inject my Tpetra-only solver into Belos::SolverFactory.Belos::BiCGStab{Iter,SolMgr}
. (It's not clear which class is causing the problem.) It may be easier just to use the known working Tpetra-only implementation as a guide to rewrite the existing implementation.