Closed kostrzewa closed 5 years ago
I think I nailed this bug already. Try my development branch feature/ndeg-twisted-clover.
FYI: I'm fixing various minor things that have cropped up since the code merge in my development branch (as well as implementing non-degenerate twisted clover fermions).
Sure enough I was able to reproduce it also locally on K20m with QUDA_TEX=ON
using a completely different lattice geometry.
# QUDA: Performing MG Preconditioner Setup
MG level 1 (GPU): Using curandStateMRG32k3a
MG level 1 (GPU): Allocated array of random numbers with rng_size: 2.81 MB
MG level 1 (GPU): ERROR: Failed to clear error state an illegal memory access was encountered
(rank 0, host lnode04, /qbigwork2/bartek/code/bleeding_edge/quda_develop/lib/tune.cpp:751 in tuneLaunch())
MG level 1 (GPU): last kernel called was (name=N4quda10CalculateYILb0EfLi4ELi3ELi2ELi24ENS_13CalculateYArgIfLi4ELi2ELi3ELi24ENS_5gauge10FieldOrderIfLi48ELi2EL21QudaGaugeFieldOrder_s2ELb1EsLb0EEENS3_IfLi48ELi2ELS4_2ELb1EiLb0EEENS3_IfLi3ELi1ELS4_2ELb1EfLb1EEENS_11colorspinor12FieldOrderCBIfLi4ELi3ELi24EL16QudaFieldOrder_s2EssLb0ELb0ELb1EEESB_SB_NS_6clover10FieldOrderIfLi3ELi4EL22QudaCloverFieldOrder_s4EEEEEEE,volume=16x16x16x10,aux=GPU-offline,vol=40960,stride=20480,precision=2,Ns=4,Nc=72,TwistFlavour=1,comm=0001,computeTmcAVDynamic,GPU-device)
FYI: I'm fixing various minor things that have cropped up since the code merge in my development branch (as well as implementing non-degenerate twisted clover fermions).
Sounds good, I'll be patient then and try your feature/ndeg-twisted-clover branch :)
Specifically here, the bug I found was in the coarsening of the twisted-clover matrix with dynamic clover enabled. I nailed that bug with this commit https://github.com/lattice/quda/commit/363390badd0f1c14e84e5470d9942c225673f90d (and made the code much simpler and doubled its speed as well for good measure 😉 )
I confirm that with the feature/ndeg-twisted-clover branch everything works as expected. Cheers!
Great thanks for the confirmation. I'll probably just file a pull today or tomorrow with the hotfixes from my development branch rather than wait for me to finish non-degenerate twisted clover. Will check into issue #792 as well today.
It seems that the accessor-based Dslash used in the MG for twisted-clover has an issue when running on PizDaint (with both
QUDA_TEX=ON
andQUDA_TEX=OFF
). In this test, we use the develop branch and 50ea7fdb2919e7e628a29744a4a640470a55b53e as head commit. Solver launches are via tmLQCD.Double-single and double-half CG work fine, but just to make sure I'm currently building with
QUDA_LEGACY_DSLASH=ON
to check if it's something not directly related to the new Dslash implementation.I'm also testing on our local cluster if I can reproduce the issue.
The strange thing is that as I reported during the development of the feature/dwf-rewrite branch, with commit f71ead214d2282556b8f5b067c29528d1fa25daa of that branch, I had a working texture and non-texture accessor-based twisted clover MG running on PizDaint giving perfect results.
@maddyscientist please don't delete the feature/dwf-rewrite branch just yet. Maybe I can find the time at some point to
git-bisect
this bugger.