Multi-mass solver not reaching target residual, but claims to have converged.

jpfoley commented 12 years ago

We're having issues with the single-precision multi-mass staggered inverter. We specifiy a target residual and the inverter claims to have converged before the residual printed in Quda (cumulative and "real") reaches the target value. This seems to be a relatively new bug. Is the inverter checking that just one of the solution vectors reaches the target value? Did Balint have similar issues with the clover inverter?

maddyscientist commented 12 years ago

I'm looking at this now. Going to compare the iteration count and residual convergence history between multi-shift and regular CG. For a single shift, they should be identical, any deviations would suggest that something is amiss in the multi-shift solver. Hopefully I can reproduce this on 1 GPU from the tests.

maddyscientist commented 12 years ago

Ok there are a number of issues with both the single-precision multishift CG that can be improved upon:

It is not a problem with the multi-shift solver per se, rather the underlying CG process is broken when one tries to do a single precision CG with no reliable updates. The current multi-shift solver doesn't use reliable updates, so has problems.
Even the regular CG solver with reliable updates has problems: it can get stuck in an infinite loop since it can keep trying converge to an accuracy that is not achievable given the precision that the solver is using.
Moreover, even when reliable updates works, and the solver reaches target convergence, the iteration count can increase hugely over regular double precision CG. This is because the recurrence relations have partial breakdown, and it is more sensitive to this than BiCGstab.

I believe there are solutions to all of this, by doing the following

Use reliable updates in the multi-shift solver. This is not to do mixed-precision, rather to keep the single-precision errors in check.
Have the solver check whenever it does a reliable update if the new residual is greater than the previously computed reliable updated residual. If are not at the limit of precision, then this should never be the case, and if we are, then it is implying that the solver can do no more useful work. Hence we exit the solver. An implementation in this style will mean that we can whatever target tolerance to the solver that we like, but the solver will only do useful work.
The CG recurrence relations can be stabilized using a different computation method for Beta_k, instead one should use the Polak-Ribiere formula

Betak = (r{k+1} , r{k+1} - r{k}) / (r_k, r_k)

I will do some tests on this, but here are some initial results on this:

pure single precision CG V=24^4 random lattice tol = 1e-6

Regular CG mass iter true relative residual 0.1 310 1.616129e-05 0.01 3112 4.007415e-03 0.001 31147 1.244143e+00

Reliably updated CG, reliable_delta = 0.1, no breakout mass iter true relative residual 0.1 326 1.732718e-06 0.01 50000 4.550707e-05 (stuck in infinite loop, reached max iter) 0.001 50000 1.489694e-03 (stuck in infinite loop, reached max iter)

Reliably updated CG, reliable_delta = 0.1, with breakout and use the Polak-Ribiere formula mass iter true relative residual 0.1 326 1.672067e-06 0.01 3008 4.320843e-05 0.001 20893 1.394405e-03

It is easy to see that the last strategy is the way to go, it combines accuracy with short iteration count. The question is whether we can put this into the multi-shift solver. I think it will work, because all we are doing here is improving solver robustness and correctness. It might make the shifted recurrence less stable, but this is likely to be less of an issue than the current problem since these will converge better because of reduce condition number

maddyscientist commented 12 years ago

I have started a multishift branch to experiment with this strategy and also clean up some of the multi-shift interface too.

rbabich commented 12 years ago

Excellent work diagnosing that.

lattice / quda

Multi-mass solver not reaching target residual, but claims to have converged. #64