lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
286 stars 94 forks source link

Multi-mass solver not reaching target residual, but claims to have converged. #64

Closed jpfoley closed 11 years ago

jpfoley commented 12 years ago

We're having issues with the single-precision multi-mass staggered inverter. We specifiy a target residual and the inverter claims to have converged before the residual printed in Quda (cumulative and "real") reaches the target value. This seems to be a relatively new bug. Is the inverter checking that just one of the solution vectors reaches the target value? Did Balint have similar issues with the clover inverter?

maddyscientist commented 12 years ago

I'm looking at this now. Going to compare the iteration count and residual convergence history between multi-shift and regular CG. For a single shift, they should be identical, any deviations would suggest that something is amiss in the multi-shift solver. Hopefully I can reproduce this on 1 GPU from the tests.

maddyscientist commented 12 years ago

Ok there are a number of issues with both the single-precision multishift CG that can be improved upon:

I believe there are solutions to all of this, by doing the following

Betak = (r{k+1} , r{k+1} - r{k}) / (r_k, r_k)

I will do some tests on this, but here are some initial results on this:

pure single precision CG V=24^4 random lattice tol = 1e-6

Regular CG mass iter true relative residual 0.1 310 1.616129e-05 0.01 3112 4.007415e-03 0.001 31147 1.244143e+00

Reliably updated CG, reliable_delta = 0.1, no breakout mass iter true relative residual 0.1 326 1.732718e-06 0.01 50000 4.550707e-05 (stuck in infinite loop, reached max iter) 0.001 50000 1.489694e-03 (stuck in infinite loop, reached max iter)

Reliably updated CG, reliable_delta = 0.1, with breakout and use the Polak-Ribiere formula mass iter true relative residual 0.1 326 1.672067e-06 0.01 3008 4.320843e-05 0.001 20893 1.394405e-03

It is easy to see that the last strategy is the way to go, it combines accuracy with short iteration count. The question is whether we can put this into the multi-shift solver. I think it will work, because all we are doing here is improving solver robustness and correctness. It might make the shifted recurrence less stable, but this is likely to be less of an issue than the current problem since these will converge better because of reduce condition number

maddyscientist commented 12 years ago

I have started a multishift branch to experiment with this strategy and also clean up some of the multi-shift interface too.

rbabich commented 12 years ago

Excellent work diagnosing that.