lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
279 stars 94 forks source link

staggered multi shift diverges with milc #128

Closed mathiaswagner closed 9 years ago

mathiaswagner commented 10 years ago

I tries to run 'su3_leapfrog' and 'su3_rhmc_hisq' using quad 0.7 (136a4ca8d74f6f87f17a286a17a29e5fc6d0130c) and milc from lattice/milc.

It seems the multishift inverter does not converge.

For the hisq test from milc (ks_imp_rhmc/su3_rhmc_hisq.1.sample-in) the solver diverges.

ERROR: Solver appears to have diverged (solver.cpp:86 in PrintStats())

For asqtad (su3_leapfrog) I think the problems is similar, at least that is my first guess.

Here is a short clip from the output:

WARNING: MultiShiftCG: Shift 1, updated residual 6.687971e-03 is greater than previous residual -9.371372e-03
WARNING: MultiShiftCG: Shift 1, updated residual 4.692231e-03 is greater than previous residual -6.687971e-03
WARNING: MultiShiftCG: Shift 1, updated residual 3.387602e-03 is greater than previous residual -4.692231e-03
WARNING: MultiShiftCG: Shift 1, updated residual 2.356393e-03 is greater than previous residual -3.387602e-03
WARNING: MultiShiftCG: Shift 1, updated residual 1.702878e-03 is greater than previous residual -2.356393e-03
WARNING: MultiShiftCG: Shift 1, updated residual 1.189301e-03 is greater than previous residual -1.702878e-03
WARNING: MultiShiftCG: Shift 1, updated residual 8.511159e-04 is greater than previous residual -1.189301e-03
WARNING: MultiShiftCG: Shift 1, updated residual 5.993229e-04 is greater than previous residual -8.511159e-04
MultiShift CG: Converged after 93 iterations
 shift=0, relative residua: iterated = nan, true = nan
 shift=1, relative residua: iterated = 9.532641e-07, true = nan
 shift=2, relative residua: iterated = 8.035050e-07, true = nan
 shift=3, relative residua: iterated = 5.507116e-07, true = nan
 shift=4, relative residua: iterated = 4.357880e-07, true = nan
 shift=5, relative residua: iterated = 7.541470e-07, true = nan
 shift=6, relative residua: iterated = 1.287412e-07, true = nan
 shift=7, relative residua: iterated = 1.153158e-07, true = nan
 shift=8, relative residua: iterated = 1.501501e-08, true = nan
 NOT converged final_rsq= nan (cf 1e-12) rel = 1 (cf 0) restarts = 0 iters= 93
 NOT converged final_rsq= nan (cf 1e-12) rel = 1 (cf 0) restarts = 5 iters= 8750
 OK converged final_rsq= 8.4e-13 (cf 1e-12) rel = 1 (cf 0) restarts = 1 iters= 34
 OK converged final_rsq= 4.4e-13 (cf 1e-12) rel = 1 (cf 0) restarts = 2 iters= 27
 OK converged final_rsq= 3.6e-13 (cf 1e-12) rel = 1 (cf 0) restarts = 2 iters= 20
 OK converged final_rsq= 9.6e-13 (cf 1e-12) rel = 1 (cf 0) restarts = 1 iters= 13
 OK converged final_rsq= 1.4e-13 (cf 1e-12) rel = 1 (cf 0) restarts = 1 iters= 10
 OK converged final_rsq= 6.2e-13 (cf 1e-12) rel = 1 (cf 0) restarts = 1 iters= 7
 OK converged final_rsq= 6.3e-14 (cf 1e-12) rel = 1 (cf 0) restarts = 1 iters= 6
 OK converged final_rsq= 5.1e-15 (cf 1e-12) rel = 1 (cf 0) restarts = 1 iters= 4
GRSOURCE: sum = 3.9346405411e+05
WARNING: MultiShiftCG: Shift 6, updated residual 8.063661e-05 is greater than previous residual -8.911129e-04
WARNING: MultiShiftCG: Shift 5, updated residual 4.715314e-04 is greater than previous residual -2.348983e-03
WARNING: MultiShiftCG: Shift 4, updated residual 6.657001e-03 is greater than previous residual -1.897417e-02
mathiaswagner commented 10 years ago

Just tried with tifr-reduc branch 6e6a6813ad7013e8a97c2df43c07a31a2e3bee07, i.e. before the 0.7 merge and for the hisq example I get the same error.

maddyscientist commented 10 years ago

Is this still a problem? Can we close this?

mathiaswagner commented 9 years ago

Carleton reproduced this issue using MILC 7.7.12 using the current version of quda-0.7:

I compiled with maximum verbosity. Here are some more details. The multishift cg first improves |r|/|b|. Then the residual blows up.

Carleton

MultiShift CG: 0 iterations,  = 1.990714e+06, |r|/|b| = 1.000000e+00
MultiShift CG: 1 iterations,  = 4.146271e+05, |r|/|b| = 4.563777e-01
MultiShift CG: Shift 1 converged after 2 iterations
MultiShift CG: Shift 2 converged after 2 iterations
MultiShift CG: Shift 3 converged after 2 iterations
MultiShift CG: Shift 4 converged after 2 iterations
MultiShift CG: Shift 5 converged after 2 iterations
MultiShift CG: Shift 6 converged after 2 iterations
MultiShift CG: Shift 7 converged after 2 iterations
MultiShift CG: 2 iterations,  = 1.990714e+06, |r|/|b| = 1.000000e+00
MultiShift CG: Shift 1 converged after 3 iterations
MultiShift CG: Shift 2 converged after 3 iterations
MultiShift CG: Shift 3 converged after 3 iterations
MultiShift CG: 3 iterations,  = 1.990714e+06, |r|/|b| = 1.000000e+00
MultiShift CG: Shift 1 converged after 4 iterations
MultiShift CG: Shift 2 converged after 4 iterations
MultiShift CG: 4 iterations,  = 1.990714e+06, |r|/|b| = 1.000000e+00
MultiShift CG: Shift 1 converged after 5 iterations
MultiShift CG: 5 iterations,  = 1.990714e+06, |r|/|b| = 1.000000e+00
MultiShift CG: 6 iterations,  = 1.990714e+06, |r|/|b| = 1.000000e+00
MultiShift CG: 7 iterations,  = 4.146271e+05, |r|/|b| = 4.563777e-01
MultiShift CG: 8 iterations,  = 4.016667e+33, |r|/|b| = 4.491883e+13
MultiShift CG: 9 iterations,  = nan, |r|/|b| = nan
maddyscientist commented 9 years ago

This issue has now been fixed as of 2a96da3f2da60b6f076aba78fb63a53f7c9a84bf.