lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
286 stars 94 forks source link

Differences between the cumulative and true residuals in the multi-mass solver on Keeneland and Titan. #71

Closed jpfoley closed 11 years ago

jpfoley commented 12 years ago

Steve Gottlieb has been running benchmark jobs on Keeneland lately, and he got in touch this week to say that the multi-mass solver was exhibiting some funny behaviour. At the end of the inversions, the cumulative and true residuals reported by QUDA differ significantly. The true residual is often several orders of magnitude larger than the reported cumulative value. Steve is running the pure double precision solver. I looked back at the results Balint generated on Titan, and I now see that Balint's results show similar behaviour. The difference in residuals in not as pronounced in Balint's data, but then the target residual is larger than the target value used in Steve's tests. I reran Steve's test job on four 2070 GPUs and found almost exact agreement between the residual estimates. Not sure if this is really a bug in QUDA, and if so, if it's only a problem on 2090s, or if there's something else misconfigured on Keeneland and Titan.

jpfoley commented 12 years ago

This seems to be the problem with GPUDirect back again. I ran some multi-gpu jobs on Keeneland this morning and found that everything was fine if I ran on a single node, but I had problems if I ran on multiple nodes. I disabled GPUDirect in the build and ran again, and this time everything looks okay. I have run on two and four nodes with two GPUs per node. The runtime fix that worked on Dsg and Dirac does not seem to work here, so QUDA needs to be compiled without GPDirect. I am waiting for Steve to confirm my findings. When he does, I will close this issue and update the issue relating to our problems with GPUDirect.

jpfoley commented 12 years ago

Unfortunately, the problem on Keeneland is a bit more complicated than I thought. GPUDirect seems to be an issue, but it's not the only one. It turns out that Steve was having additional problems with the mixed-precision multi-shift solver. I discovered that this is really a problem with the single-precision solver that is called internally. Everything is fine if we run pure double-precision multi-shift. To make sure that this isn't a recent bug, I also tested the quda-0.4.0 release code, and found the same problem. To remind you, the issue is that the cumulative residual norms in the solver decrease as expected until convergence is declared. The true residuals calculated after the solver claims to have converged are much larger than the cumulative residual norms (in some cases, by several orders of magnitude).

maddyscientist commented 12 years ago

Can you have a go at reproducing this error using the internal tests? E.g., just run a single-precision multi-shift solver and look for evidence of this.

I'm closing in on BQCD work, so I will have spare cycles to help with this in a day or two hopefully.

jpfoley commented 12 years ago

Sure. I want to say that naive internal tests didn't product this error, but I need to check that.

On 08/08/2012 10:54 AM, mikeaclark wrote:

Can you have a go at reproducing this error using the internal tests? E.g., just run a single-precision multi-shift solver and look for evidence of this.

I'm closing in on BQCD work, so I will have spare cycles to help with this in a day or two hopefully.

— Reply to this email directly or view it on GitHub https://github.com/lattice/quda/issues/71#issuecomment-7589845.

maddyscientist commented 11 years ago

I think we can close this issue now. The new multi-shift solver has now been merged into master and (subject to verification) the GPU Direct issue has been fixed. Again, commit 3b21a83a978e279b9599498119f5fbce7a6ae150 is the one to test with.