Clover first solve succeeds second fails

bjoo commented 12 years ago

In a recent master branch we observed the following behaviour at JLab in the clover solver:

i) The first solve succeeds
ii) The second solve appears to converge to the wrong answer

This was tested in commit ID: 0929692d2b043eca408a78fac4b70cf4620411cb This behaviour is not present in version: a0c7a3b1a36a2ea152140ae9a29961949a8044f5 (which was identified by Frank as stable).

Specific output:

First solve OK:

BiCGstab: 2181 iterations, r2 = 2.350111e-13 BiCGstab: 2182 iterations, r2 = 2.314445e-13 BiCGstab: Reliable updates = 21 BiCGstab: Converged after 2182 iterations, relative residua: iterated = 4.984836e-07, true = 7.308019e-07 Solution = 2.970304 Reconstructed: CUDA solution = 2.970304, CPU copy = 2.970303 Cuda Space Required Spinor:0.1640625 GiB Gauge :0 GiB InvClover :0 GiB QUDA_BICGSTAB_CLOVER_SOLVER: time=46.575336 s Performance=1648.30932630335 GFLOPS Total Time (incl. load gauge)=51.613748 s QUDA_BICGSTAB_CLOVER_SOLVER: 2182 iterations. Rsd = 1.274569e-06 Relative Rsd = 7.56500540505804e-07

second solve: only 606 iterations and QUDA and Chroma disagree

BiCGstab: 606 iterations, r2 = 2.035114e-13 BiCGstab: Reliable updates = 10 BiCGstab: Converged after 606 iterations, relative residua: iterated = 4.677403e-07, true = 4.998825e-07 Solution = 1.833076 Reconstructed: CUDA solution = 1.833076, CPU copy = 1.833076 Cuda Space Required Spinor:0.1640625 GiB Gauge :0 GiB InvClover :0 GiB QUDA_BICGSTAB_CLOVER_SOLVER: time=5.873358 s Performance=3641.36232313031 GFLOPS Total Time (incl. load gauge)=6.105013 s ERROR: QUDA Solver residuum is outside tolerance: QUDA resid=0.0460195576642859 Desired =5e-07 Max Tolerated = 5e-06 QUDA_BICGSTAB_CLOVER_SOLVER: 606 iterations. Rsd = 0.07754742 Relative Rsd = 0.0460195576642859

Similar behaviour was reported from Edge recently.

maddyscientist commented 12 years ago

Balint, if I am to debug this, can you give me an appropriate chroma build package, so I build and run in ignorance and swap out different versions of quda as I see fit until the change that causes this bug is tracked down?

maddyscientist commented 12 years ago

Ok, I think now is the time to test this. I have just merged in my BQCD branch, and I can see no problems with successive solves there. Balint, can you do a fresh pull and check that this issue has gone?

maddyscientist commented 11 years ago

All final Chroma issues should now be fixed as of commit 1b77cc988b5a11aae67b7bcc400c7d1df7800c54. Leaving this open until Balint confirms this.

maddyscientist commented 11 years ago

I have finally managed to reproduce the originally reported issue in the QUDA tests. The successive solve problem only occurs with multiple GPUs one separate nodes, i.e., over InfiniBand. The problem occurs for all precision combinations.

Working on locating the source now.

maddyscientist commented 11 years ago

I still haven't isolated this yet, but I have found the problem occurs with Wilson as well as clover, and it happens with MPI as well as QMP back ends. Need to sleep now.

jpfoley commented 11 years ago

This sounds pretty similar to the problems we were having with GPUDirect. Where were you running?

On 09/11/2012 11:44 PM, mikeaclark wrote:

I have finally managed to reproduce the originally reported issue in the QUDA tests. The successive solve problem only occurs with multiple GPUs one separate nodes, i.e., over InfiniBand. The problem occurs for all precision combinations.

Working on locating the source now.

— Reply to this email directly or view it on GitHub https://github.com/lattice/quda/issues/74#issuecomment-8482766.

maddyscientist commented 11 years ago

I was running on an internal cluster, using two nodes with one M2090 per node. Having interactive access is a huge bonus to this type of debugging.

I just tried to repro with the staggered invert test and I cannot - success solves always converge. This doesn't exclude that the GPU Direct bug is related to this one.

Testing now with GPU Direct disabled.....

maddyscientist commented 11 years ago

Ok, with GPU Direct disabled, the bug goes away. I guess this is related to the other GPU Direct issues. This is with 1.5.4 openmpi and CUDA 4.0. Continuing to investigate....

jpfoley commented 11 years ago

You ran with Rolf's flag, I guess. I think Steve was seeing this same problem on Keeneland. The runtime flag seems to work on some platforms, but not on others.

On 09/12/2012 11:14 AM, mikeaclark wrote:

Ok, with GPU Direct disabled, the bug goes away. I guess this is related to the other GPU Direct issues. This is with 1.5.4 openmpi and CUDA 4.0. Continuing to investigate....

— Reply to this email directly or view it on GitHub https://github.com/lattice/quda/issues/74#issuecomment-8502246.

bjoo commented 11 years ago

Hi Mike, I can also do a quick rebuild with MVAPICH2 with and without GPU direct if having a different MPI helps.

Best, B On Sep 12, 2012, at 1:16 PM, Justin Foley wrote:

You ran with Rolf's flag, I guess. I think Steve was seeing this same problem on Keeneland. The runtime flag seems to work on some platforms, but not on others.

On 09/12/2012 11:14 AM, mikeaclark wrote:

Ok, with GPU Direct disabled, the bug goes away. I guess this is related to the other GPU Direct issues. This is with 1.5.4 openmpi and CUDA 4.0. Continuing to investigate....

— Reply to this email directly or view it on GitHub https://github.com/lattice/quda/issues/74#issuecomment-8502246.

— Reply to this email directly or view it on GitHub.

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427

email: bjoo@jlab.org

maddyscientist commented 11 years ago

I didn't run with Rolf's flags. I did now, and found that the issue goes away. So this is definitely the same problem.

maddyscientist commented 11 years ago

Just made huge progress. The current QMP and MPI backends use cudaHostAlloc to create pinned memory which is then also pinned used by IB. The alternative is to simply do a malloc and then use cudaHostRegister to pin the memory to CUDA. There should be no difference..........but doing the latter makes the issue go away.

The reason this does not show up with Frank stable is because now FaceBuffer is recreated with every invertQuda call, whereas previously it was reused between invertQuda calls. This is consistent with the fact that with current master the first solve works correctly, but subsequently solves do not work. There appears to be something wrong with reallocing pinned memory directly when the FaceBuffer is recreated.

Looking more into this, but it appears I have a fix if not totally understanding it.

maddyscientist commented 11 years ago

Balint please try commit 3b21a83a978e279b9599498119f5fbce7a6ae150 when you have a chance. I believe this fixes the issue.

bjoo commented 11 years ago

I tried a variety of modes here with WEAK_FIELD tests on 16 GTX480 GPUs (24x24x24x128 lattice spread out on virt geometry: 1x1x1x16, CentOS 5.5, CUDA-4.2, MVAPICH2-1.8, Driver version 304.43(certified))

HALF(12)-SINGLE(12) - OK SINGLE(12)-SINGLE(12) - OK HALF(12)-SINGLE(18) - OK HALF(12)-DOUBLE(18) - OK SINGLE(12)-DOUBLE(18) - OK DOUBLE(18)-DOUBLE(18)- OK

(clearly there are more combinatorics, e.g. 8 reconstructs etc, but I think these are the most important ones potentially).

In addition I ran a user job which performed something like 192 calls to QUDA inversions using HALF(12)-DOUBLE(18) in 16 full prop calculations with I/O in between the prop calculations and that job ran through fine also.

At this time I am happy to sign off on this and close this issue. If user experience reveals further problems we can open a new issue. Great job on sorting this out. It was a nasty.

lattice / quda

Clover first solve succeeds second fails #74

email: bjoo@jlab.org