lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
280 stars 94 forks source link

PizDaint twisted-clover tuneLaunch() failure with accessor-based Dslash in some part of MG #793

Closed kostrzewa closed 5 years ago

kostrzewa commented 5 years ago

It seems that the accessor-based Dslash used in the MG for twisted-clover has an issue when running on PizDaint (with both QUDA_TEX=ON and QUDA_TEX=OFF). In this test, we use the develop branch and 50ea7fdb2919e7e628a29744a4a640470a55b53e as head commit. Solver launches are via tmLQCD.

# 2 kappa mu = 0.000200774200, kappa = 0.139426500000, c_sw = 1.690000000000
# Using even/odd preconditioning!
# QUDA: Using single prec. as sloppy!
# QUDA: Called _loadGaugeQuda
# QUDA: Theta boundary conditions will be applied to gauge field
# QUDA: Using MG!
# QUDA: Using EO preconditioning!
# QUDA: Time for loadCloverQuda: 9.4900e+00
# QUDA: mu = 0.000720000143, kappa = 0.139426500000, csw = 1.690000000000
# QUDA: Performing MG Preconditioner Setup
MG level 1 (GPU): Using curandStateMRG32k3a
MG level 1 (GPU): Allocated array of random numbers with rng_size: 36.00 MB
MG level 1 (GPU): ERROR: Failed to clear error state unspecified launch failure
 (rank 0, host nid05446, /users/bartek/code/2019_04_17/quda_develop/lib/tune.cpp:751 in tuneLaunch())
MG level 1 (GPU):        last kernel called was (name=N4quda10CalculateYILb0EfLi4ELi3ELi2ELi24ENS_13CalculateYArgIfLi4ELi2ELi3ELi24ENS_5gauge10FieldOrderIfLi48ELi2EL21QudaGaugeFieldOrder_s2ELb1EsLb0EEENS3_IfLi48ELi2ELS4_2ELb1EiLb0EEENS3_IfLi3ELi1ELS4_2ELb1EfLb0EEENS_11colorspinor12FieldOrderCBIfLi4ELi3ELi24EL16QudaFieldOrder_s2EssLb0ELb0ELb0EEESB_SB_NS_6clover10FieldOrderIfLi3ELi4EL22QudaCloverFieldOrder_s4EEEEEEE,volume=64x32x16x16,aux=GPU-offline,vol=524288,stride=262144,precision=2,Ns=4,Nc=72,TwistFlavour=1,comm=0111,computeTmcAVDynamic,GPU-device)
Rank 0 [Thu Apr 18 15:22:33 2019] [c8-2c1s1n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

Double-single and double-half CG work fine, but just to make sure I'm currently building with QUDA_LEGACY_DSLASH=ON to check if it's something not directly related to the new Dslash implementation.

I'm also testing on our local cluster if I can reproduce the issue.

The strange thing is that as I reported during the development of the feature/dwf-rewrite branch, with commit f71ead214d2282556b8f5b067c29528d1fa25daa of that branch, I had a working texture and non-texture accessor-based twisted clover MG running on PizDaint giving perfect results.

@maddyscientist please don't delete the feature/dwf-rewrite branch just yet. Maybe I can find the time at some point to git-bisect this bugger.

maddyscientist commented 5 years ago

I think I nailed this bug already. Try my development branch feature/ndeg-twisted-clover.

maddyscientist commented 5 years ago

FYI: I'm fixing various minor things that have cropped up since the code merge in my development branch (as well as implementing non-degenerate twisted clover fermions).

kostrzewa commented 5 years ago

Sure enough I was able to reproduce it also locally on K20m with QUDA_TEX=ON using a completely different lattice geometry.

# QUDA: Performing MG Preconditioner Setup
MG level 1 (GPU): Using curandStateMRG32k3a
MG level 1 (GPU): Allocated array of random numbers with rng_size: 2.81 MB
MG level 1 (GPU): ERROR: Failed to clear error state an illegal memory access was encountered
 (rank 0, host lnode04, /qbigwork2/bartek/code/bleeding_edge/quda_develop/lib/tune.cpp:751 in tuneLaunch())
MG level 1 (GPU):        last kernel called was (name=N4quda10CalculateYILb0EfLi4ELi3ELi2ELi24ENS_13CalculateYArgIfLi4ELi2ELi3ELi24ENS_5gauge10FieldOrderIfLi48ELi2EL21QudaGaugeFieldOrder_s2ELb1EsLb0EEENS3_IfLi48ELi2ELS4_2ELb1EiLb0EEENS3_IfLi3ELi1ELS4_2ELb1EfLb1EEENS_11colorspinor12FieldOrderCBIfLi4ELi3ELi24EL16QudaFieldOrder_s2EssLb0ELb0ELb1EEESB_SB_NS_6clover10FieldOrderIfLi3ELi4EL22QudaCloverFieldOrder_s4EEEEEEE,volume=16x16x16x10,aux=GPU-offline,vol=40960,stride=20480,precision=2,Ns=4,Nc=72,TwistFlavour=1,comm=0001,computeTmcAVDynamic,GPU-device)

FYI: I'm fixing various minor things that have cropped up since the code merge in my development branch (as well as implementing non-degenerate twisted clover fermions).

Sounds good, I'll be patient then and try your feature/ndeg-twisted-clover branch :)

maddyscientist commented 5 years ago

Specifically here, the bug I found was in the coarsening of the twisted-clover matrix with dynamic clover enabled. I nailed that bug with this commit https://github.com/lattice/quda/commit/363390badd0f1c14e84e5470d9942c225673f90d (and made the code much simpler and doubled its speed as well for good measure 😉 )

kostrzewa commented 5 years ago

I confirm that with the feature/ndeg-twisted-clover branch everything works as expected. Cheers!

maddyscientist commented 5 years ago

Great thanks for the confirmation. I'll probably just file a pull today or tomorrow with the hotfixes from my development branch rather than wait for me to finish non-degenerate twisted clover. Will check into issue #792 as well today.