lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
291 stars 97 forks source link

persistent deflation subspace triggers errorQuda due to missing allocation for eigenvalues #929

Closed kostrzewa closed 4 years ago

kostrzewa commented 4 years ago

@cpviolator @maddyscientist

I somehow manage to trigger the size test upon trying to use the persistent deflation subspace:

MG level 2 (GPU): ERROR: Requesting 2048 eigenvalues with only storage allocated for 0 (rank 0, host nid02954, /users/bartek/code/2019_11_21/quda_develop/lib/eigensolve_quda.cpp:386 in computeEvals())
MG level 2 (GPU):        last kernel called was (name=N4quda4blas5Norm2Id6float2S2_EE,volume=4x4x4x2,aux=vol=128,stride=128,precision=4,Ns=2,Nc=24,TwistFlavour=1)

it seems that this was added at the end of October (this is the latest develop, after LU_rotate was merged):

$ git blame lib/eigensolve_quda.cpp
[...]
917123e267 (cpviolator     2019-05-18 12:30:22 -0700  384)   {
e13a6a7eae (maddyscientist 2019-10-28 17:17:16 -0700  385)     if (size > (int)evecs.size()) errorQuda("Requesting %d eigenvectors with only storage allocated for %lu", size, evecs.size());
e13a6a7eae (maddyscientist 2019-10-28 17:17:16 -0700  386)     if (size > (int)evals.size()) errorQuda("Requesting %d eigenvalues with only storage allocated for %lu", size, evals.size());
e13a6a7eae (maddyscientist 2019-10-28 17:17:16 -0700  387) 
55782743d1 (cpviolator     2019-09-12 15:45:32 -0700  388)     ColorSpinorParam csParam(*evecs[0]);

and I'm wondering if at the same time, the requirements for preserving the subspace changed somehow.

I can see that the check itself was added in e13a6a7eae0efc0879eea02b91a7d7a28c2d94e1.

When I trigger the issue, what I see is the following:

# QUDA: Updating MG Preconditioner Setup for gauge 500
# QUDA: Deflation subspace for gauge 500 will be re-used!
MG level 0 (GPU): Resetting level 0
MG level 0 (GPU): Transfer: block orthogonalizing
MG level 0 (GPU): Block Orthogonalizing 8192 blocks of 36864 length and width 24 repeating 1 times
MG level 0 (GPU): Creating coarse Dirac operator
MG level 0 (GPU): Computing Y field......
MG level 0 (GPU): Doing bi-directional link coarsening
MG level 0 (GPU): Running link coarsening on the GPU
MG level 0 (GPU): V2 = 6.291458e+06
MG level 0 (GPU): Computing TMCAV
MG level 0 (GPU): AV2 = 9.765187e+06
MG level 0 (GPU): Computing forward 0 UV and VUV
MG level 0 (GPU): UV2[0] = 6.291473e+06
MG level 0 (GPU): Y2[4] (atomic) = 7.641924e+05
MG level 0 (GPU): Y2[4] = 7.641924e+05
MG level 0 (GPU): Computing forward 1 UV and VUV
MG level 0 (GPU): UV2[1] = 6.291472e+06
MG level 0 (GPU): Y2[5] (atomic) = 7.641607e+05
MG level 0 (GPU): Y2[5] = 7.641607e+05
MG level 0 (GPU): Computing forward 2 UV and VUV
MG level 0 (GPU): UV2[2] = 6.291472e+06
MG level 0 (GPU): Y2[6] (atomic) = 7.639333e+05
MG level 0 (GPU): Y2[6] = 7.639335e+05
MG level 0 (GPU): Computing forward 3 UV and VUV
MG level 0 (GPU): UV2[3] = 6.291472e+06
MG level 0 (GPU): Y2[7] (atomic) = 7.638894e+05
MG level 0 (GPU): Y2[7] = 7.638894e+05
MG level 0 (GPU): Computing backward 0 UV and VUV
MG level 0 (GPU): UAV2[0] = 9.766770e+06
MG level 0 (GPU): Y2[0] (atomic) = 7.642173e+05
MG level 0 (GPU): Y2[0] = 7.642173e+05
MG level 0 (GPU): Computing backward 1 UV and VUV
MG level 0 (GPU): UAV2[1] = 9.767763e+06
MG level 0 (GPU): Y2[1] (atomic) = 7.645393e+05
MG level 0 (GPU): Y2[1] = 7.645393e+05
MG level 0 (GPU): Computing backward 2 UV and VUV
MG level 0 (GPU): UAV2[2] = 9.767596e+06
MG level 0 (GPU): Y2[2] (atomic) = 7.642292e+05
MG level 0 (GPU): Y2[2] = 7.642292e+05
MG level 0 (GPU): Computing backward 3 UV and VUV
MG level 0 (GPU): UAV2[3] = 9.769158e+06
MG level 0 (GPU): Y2[3] (atomic) = 7.645721e+05
MG level 0 (GPU): Y2[3] = 7.645722e+05
MG level 0 (GPU): X2 = 2.213295e+06
MG level 0 (GPU): Summing diagonal contribution to coarse clover
MG level 0 (GPU): X2 = 1.341718e+06
MG level 0 (GPU): ....done computing Y field
MG level 0 (GPU): Computing Yhat field......
MG level 0 (GPU): Batched matrix inversion completed in 0.018598 seconds with GFLOPS = 194.461994
MG level 0 (GPU): Xinv = 5.281667e+07
MG level 0 (GPU): Yhat Max = 4.797610e+00
MG level 0 (GPU): Yhat[0] = 3.854430e+07 (4.049026e+00 5.806168e+00 = 7.138728e-01 x 8.133336e+00)
MG level 0 (GPU): Yhat[1] = 3.855097e+07 (3.746425e+00 5.298562e+00 = 6.514624e-01 x 8.133336e+00)
MG level 0 (GPU): Yhat[2] = 3.853976e+07 (4.806037e+00 5.878066e+00 = 7.227128e-01 x 8.133336e+00)
MG level 0 (GPU): Yhat[3] = 3.853925e+07 (3.459894e+00 6.522995e+00 = 8.020073e-01 x 8.133336e+00)
MG level 0 (GPU): Yhat[4] = 3.853850e+07 (4.219619e+00 5.320880e+00 = 6.542063e-01 x 8.133336e+00)
MG level 0 (GPU): Yhat[5] = 3.853346e+07 (3.429812e+00 5.901252e+00 = 7.255635e-01 x 8.133336e+00)
MG level 0 (GPU): Yhat[6] = 3.853807e+07 (4.179774e+00 6.037147e+00 = 7.422720e-01 x 8.133336e+00)
MG level 0 (GPU): Yhat[7] = 3.853408e+07 (4.090363e+00 6.136233e+00 = 7.544546e-01 x 8.133336e+00)
MG level 0 (GPU): ....done computing Yhat field
MG level 0 (GPU): Coarse Dirac operator done
MG level 0 (GPU): Creating smoother
MG level 0 (GPU): Creating a CA-GCR solver
MG level 0 (GPU): Smoother done
MG level 1 (GPU): Resetting level 1
MG level 1 (GPU): Extracting deflation space size 4096 to MG
MG level 1 (GPU): Transfer: block orthogonalizing
MG level 1 (GPU): Block Orthogonalizing 512 blocks of 9216 length and width 24 repeating 1 times
MG level 1 (GPU): Creating coarse Dirac operator
MG level 1 (GPU): Computing Y field......
MG level 1 (GPU): Doing bi-directional link coarsening
MG level 1 (GPU): Running link coarsening on the GPU
MG level 1 (GPU): V2 = 3.932161e+05
MG level 1 (GPU): Computing forward 0 UV and VUV
MG level 1 (GPU): UV2[0] = 6.777509e+05
MG level 1 (GPU): Y2[4] (atomic) = 1.693279e+05
MG level 1 (GPU): Y2[4] = 1.693279e+05
MG level 1 (GPU): Computing forward 1 UV and VUV
MG level 1 (GPU): UV2[1] = 6.779548e+05
MG level 1 (GPU): Y2[5] (atomic) = 1.689860e+05
MG level 1 (GPU): Y2[5] = 1.689861e+05
MG level 1 (GPU): Computing forward 2 UV and VUV
MG level 1 (GPU): UV2[2] = 6.784520e+05
MG level 1 (GPU): Y2[6] (atomic) = 1.692619e+05
MG level 1 (GPU): Y2[6] = 1.692619e+05
MG level 1 (GPU): Computing forward 3 UV and VUV
MG level 1 (GPU): UV2[3] = 6.781088e+05
MG level 1 (GPU): Y2[7] (atomic) = 1.692430e+05
MG level 1 (GPU): Y2[7] = 1.692431e+05
MG level 1 (GPU): Computing backward 0 UV and VUV
MG level 1 (GPU): UAV2[0] = 1.041606e+06
MG level 1 (GPU): Y2[0] (atomic) = 1.695219e+05
MG level 1 (GPU): Y2[0] = 1.695219e+05
MG level 1 (GPU): Computing backward 1 UV and VUV
MG level 1 (GPU): UAV2[1] = 1.042432e+06
MG level 1 (GPU): Y2[1] (atomic) = 1.694291e+05
MG level 1 (GPU): Y2[1] = 1.694291e+05
MG level 1 (GPU): Computing backward 2 UV and VUV
MG level 1 (GPU): UAV2[2] = 1.041265e+06
MG level 1 (GPU): Y2[2] (atomic) = 1.692224e+05
MG level 1 (GPU): Y2[2] = 1.692224e+05
MG level 1 (GPU): Computing backward 3 UV and VUV
MG level 1 (GPU): UAV2[3] = 1.042074e+06
MG level 1 (GPU): Y2[3] (atomic) = 1.697555e+05
MG level 1 (GPU): Y2[3] = 1.697555e+05
MG level 1 (GPU): X2 = 4.596876e+04
MG level 1 (GPU): Summing diagonal contribution to coarse clover
MG level 1 (GPU): X2 = 2.080047e+05
MG level 1 (GPU): ....done computing Y field
MG level 1 (GPU): Computing Yhat field......
MG level 1 (GPU): Batched matrix inversion completed in 0.000594 seconds with GFLOPS = 380.534949
MG level 1 (GPU): Xinv = 1.487310e+06
MG level 1 (GPU): Yhat Max = 5.467275e+00
MG level 1 (GPU): Yhat[0] = 2.411073e+06 (4.135077e+00 1.357078e+01 = 1.768291e+00 x 7.674519e+00)
MG level 1 (GPU): Yhat[1] = 2.410160e+06 (4.322745e+00 1.268666e+01 = 1.653089e+00 x 7.674519e+00)
MG level 1 (GPU): Yhat[2] = 2.410167e+06 (4.922928e+00 1.819552e+01 = 2.370900e+00 x 7.674519e+00)
MG level 1 (GPU): Yhat[3] = 2.414582e+06 (5.500666e+00 1.860227e+01 = 2.423900e+00 x 7.674519e+00)
MG level 1 (GPU): Yhat[4] = 2.410500e+06 (3.822162e+00 1.417934e+01 = 1.847587e+00 x 7.674519e+00)
MG level 1 (GPU): Yhat[5] = 2.408479e+06 (3.945853e+00 1.235659e+01 = 1.610080e+00 x 7.674519e+00)
MG level 1 (GPU): Yhat[6] = 2.410118e+06 (5.106160e+00 1.676650e+01 = 2.184697e+00 x 7.674519e+00)
MG level 1 (GPU): Yhat[7] = 2.411318e+06 (4.156650e+00 1.390870e+01 = 1.812322e+00 x 7.674519e+00)
MG level 1 (GPU): ....done computing Yhat field
MG level 1 (GPU): Coarse Dirac operator done
MG level 1 (GPU): Creating smoother
MG level 1 (GPU): Creating a CA-GCR solver
MG level 1 (GPU): Creating a CA-GCR solver
MG level 1 (GPU): Smoother done
MG level 2 (GPU): Creating level 2
MG level 2 (GPU): Creating smoother
MG level 2 (GPU): Smoother done
MG level 2 (GPU): Setup of level 2 done
MG level 1 (GPU): Creating coarse solver wrapper
MG level 1 (GPU): Creating a CA-GCR solver
MG level 1 (GPU): Transferring deflation space size 4096 to coarse solver
MG level 2 (GPU): Creating TR Lanczos eigensolver
MG level 2 (GPU): ERROR: Requesting 2048 eigenvalues with only storage allocated for 0 (rank 0, host nid02954, /users/bartek/code/2019_11_21/quda_develop/lib/eigensolve_quda.cpp:386 in computeEvals())
MG level 2 (GPU):        last kernel called was (name=N4quda4blas5Norm2Id6float2S2_EE,volume=4x4x4x2,aux=vol=128,stride=128,precision=4,Ns=2,Nc=24,TwistFlavour=1)

This is after a successful set of inversions with a different quark mass.

maddyscientist commented 4 years ago

I've reproduced this with the internal tests. Sorry for the regression. Will try to get this fixed shortly...

kostrzewa commented 4 years ago

No worries :) Just as an FYI, @pittlerf and @marcuspetschlies have found that 3fa55816a49365941258f473f5379fe3bcd68f26 works fine.

pittlerf commented 4 years ago

Hi, I am using the commit 3fa55816a49365941258f473f5379fe3bcd68f26 with PLEGMA. In a few cases (4 times in 29 runs) I observe divergence in the solver. This happens always after doing 12 source position successfully, when in the 13-th source position the inverter changes from flavor UP to flavor DN. I quote the error message:

Going to invert DN for component 0
ERROR: Solver appears to have diverged (rank 0, host nid04932, /users/fpittler/code/quda/lib/solver.cpp:306 in PrintStats())
       last kernel called was (name=N4quda4blas13cabxpyzaxnormId6float26float4EE,volume=16x32x16x32,aux=vol=262144,stride=262144,precision=4,Ns=4,Nc=3,TwistFlavour=1)
cpviolator commented 4 years ago

@pittlerf @kostrzewa @maddyscientist I'm going to take a look at this now. I will use the latest develop branch to try to reproduce both issues.

cpviolator commented 4 years ago

@kostrzewa @pittlerf

I pushed some fixes to develop. I was able to reproduce the first error, where no space for the eigenvalues was created.

@pittlerf May I ask, when you flip the sign of 'mu' in your PLEGMA workflow, do you flip the sign in BOTH the QudaInvertParam and QudaMultigridParam structures? I flipped the sign in the test and saw no divergences. May I also ask, do you see good performance of the solver during the successful inversions after flipping the sign? If I cannot reproduce this error with random gauge field, and we are sure the correct information is being passed to the solver, I will spin up on SUMMIT and try with your physical lattices.

The fixes have been pushed here: https://github.com/lattice/quda/tree/hotfix/persistent_defl_fixes

pittlerf commented 4 years ago

Hi,

I can confirm that the commit 4c307d7945eda7993bd68c73c085418d935820f1 fixes the issue. I did a successful run on cyclamen, and have a run in the queue for the physical point lattice on PizDaint. We actually update both QudaInvertParam and QudaMultigridParam. For the second error discussed with @sbacchio we are on the impression that it was due to one of the GPU-s one PizDaint (nid04974). For all of our runs with the second error this GPU was involved.

maddyscientist commented 4 years ago

Glad to hear that @cpviolator's fix works. I've opened https://github.com/lattice/quda/pull/930 to get this merged into develop.