Closed kostrzewa closed 4 years ago
I've reproduced this with the internal tests. Sorry for the regression. Will try to get this fixed shortly...
No worries :) Just as an FYI, @pittlerf and @marcuspetschlies have found that 3fa55816a49365941258f473f5379fe3bcd68f26 works fine.
Hi, I am using the commit 3fa55816a49365941258f473f5379fe3bcd68f26 with PLEGMA. In a few cases (4 times in 29 runs) I observe divergence in the solver. This happens always after doing 12 source position successfully, when in the 13-th source position the inverter changes from flavor UP to flavor DN. I quote the error message:
Going to invert DN for component 0
ERROR: Solver appears to have diverged (rank 0, host nid04932, /users/fpittler/code/quda/lib/solver.cpp:306 in PrintStats())
last kernel called was (name=N4quda4blas13cabxpyzaxnormId6float26float4EE,volume=16x32x16x32,aux=vol=262144,stride=262144,precision=4,Ns=4,Nc=3,TwistFlavour=1)
@pittlerf @kostrzewa @maddyscientist I'm going to take a look at this now. I will use the latest develop branch to try to reproduce both issues.
@kostrzewa @pittlerf
I pushed some fixes to develop. I was able to reproduce the first error, where no space for the eigenvalues was created.
@pittlerf May I ask, when you flip the sign of 'mu' in your PLEGMA workflow, do you flip the sign in BOTH the QudaInvertParam
and QudaMultigridParam
structures? I flipped the sign in the test and saw no divergences. May I also ask, do you see good performance of the solver during the successful inversions after flipping the sign? If I cannot reproduce this error with random gauge field, and we are sure the correct information is being passed to the solver, I will spin up on SUMMIT and try with your physical lattices.
The fixes have been pushed here: https://github.com/lattice/quda/tree/hotfix/persistent_defl_fixes
Hi,
I can confirm that the commit 4c307d7945eda7993bd68c73c085418d935820f1 fixes the issue. I did a successful run on cyclamen, and have a run in the queue for the physical point lattice on PizDaint. We actually update both QudaInvertParam
and QudaMultigridParam
. For the second error discussed with @sbacchio we are on the impression that it was due to one of the GPU-s one PizDaint (nid04974). For all of our runs with the second error this GPU was involved.
Glad to hear that @cpviolator's fix works. I've opened https://github.com/lattice/quda/pull/930 to get this merged into develop.
@cpviolator @maddyscientist
I somehow manage to trigger the size test upon trying to use the persistent deflation subspace:
it seems that this was added at the end of October (this is the latest develop, after LU_rotate was merged):
and I'm wondering if at the same time, the requirements for preserving the subspace changed somehow.
I can see that the check itself was added in e13a6a7eae0efc0879eea02b91a7d7a28c2d94e1.
When I trigger the issue, what I see is the following:
This is after a successful set of inversions with a different quark mass.