Wilson MG fails with cuda 11.6

jxy commented 2 years ago

Using commit dd6207e6e, cuda 11.6.2, gcc 12.1.0, running on

CUDA Driver version = 11060
CUDA Runtime version = 11060
Found device 0: NVIDIA A100-PCIE-40GB
Using device 0: NVIDIA A100-PCIE-40GB

with command

./tests/invert_test --dim 16 16 16 16 \
--prec double --prec-sloppy single --prec-precondition half --prec-null half --mg-smoother-halo-prec half \
--dslash-type wilson --solve-type direct-pc --verbosity verbose --nsrc 17 --kappa 0.1 \
--inv-type gcr --inv-multigrid true --mg-levels 3 \
--mg-coarse-solve-type 0 direct-pc --mg-verbosity 0 verbose \
--mg-setup-inv 0 cgnr --mg-setup-maxiter 0 1000 --mg-setup-tol 0 1e-5 \
--mg-setup-inv 1 cgnr --mg-setup-maxiter 1 1000 --mg-setup-tol 1 1e-5 \
--mg-coarse-solve-type 0 direct-pc --mg-smoother-solve-type 0 direct-pc \
--mg-block-size 0 2 2 2 2 --mg-nvec 0 24 --mg-n-block-ortho 0 2 \
--mg-coarse-solve-type 1 direct-pc --mg-smoother-solve-type 0 direct-pc \
--mg-smoother 0 ca-gcr --mg-nu-pre 0 0 --mg-nu-post 0 2 \
--mg-smoother-tol 0 1e-10 --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
--mg-coarse-solver 1 gcr --mg-verbosity 1 verbose \
--mg-block-size 1 2 2 2 2 --mg-nvec 1 32 --mg-n-block-ortho 1 2 \
--mg-coarse-solve-type 2 direct-pc --mg-smoother-solve-type 1 direct-pc \
--mg-smoother 1 ca-gcr --mg-nu-pre 1 0 --mg-nu-post 1 2 \
--mg-smoother-tol 1 1e-10 --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 \
--mg-coarse-solver 2 ca-gcr --mg-verbosity 2 verbose

gives error

MG level 1 (GPU): Creating level 1
MG level 1 (GPU): Creating transfer operator
MG level 1 (GPU): Transfer: using block size 2 x 2 x 2 x 2
MG level 1 (GPU): Transfer: block orthogonalizing
MG level 1 (GPU): Block Orthogonalizing 512 blocks of 12288 length and width 32 repeating 2 times, two_pass = 1
MG level 1 (GPU): Transfer operator done
MG level 1 (GPU): Creating coarse Dirac operator
MG level 1 (GPU): Computing Y field......
MG level 1 (GPU): Doing bi-directional link coarsening
MG level 1 (GPU): Running link coarsening on the GPU
MG level 1 (GPU): V2 = 1.638400e+04
MG level 1 (GPU): Saving 1085 sets of cached parameters to ./tunecache.tsv
MG level 1 (GPU): Computing forward 0 UV and VUV
MG level 1 (GPU): UV2[0] = 7.057161e+03
MG level 1 (GPU): ERROR: qudaEventSynchronize_ returned CUDA_ERROR_ILLEGAL_ADDRESS
 (timer.h:100 in stop())
 (rank 0, host gpu07, quda_api.cpp:72 in void quda::target::cuda::set_driver_error(CUresult, const char*, const char*, const char*, const char*, bool)())
MG level 1 (GPU):        last kernel called was (name=N4quda10CalculateYILb1EL19QudaFieldLocation_s2ENS_13CalculateYArgILb1EfLi2ELi2ELi24ELi32ENS_5gauge10FieldOrderIfLi64ELi2EL21QudaGaugeFieldOrder_s13ELb1EsEENS4_IfLi64ELi2ELS5_13ELb1EiEENS4_IfLi48ELi2ELS5_13ELb1EsEENS_11colorspinor12FieldOrderCBIfLi2ELi24ELi32EL16QudaFieldOrder_s9EssLb0ELb0EEENSA_IfLi4ELi24ELi32ELSB_9EssLb0ELb0EEESC_S8_EEEE,volume=8x8x8x8,aux=GPU-offline,vol=4096,parity=2,precision=2,order=9,Ns=2,Nc=768,comm=0000,computeVUV,MMA,dim=0,dir=fwd,nFace=1,bidirectional,GPU-device,coarse_vol=4x4x4x4)
MG level 1 (GPU): Saving 1085 sets of cached parameters to ./tunecache_error.tsv

It works fine with changing the precision half to single in the above command. Recompiling the code with cuda 11.4.0 gcc 9.2.0 also runs fine.

maddyscientist commented 2 years ago

Just noting that CUDA 11.7.u1 has now been released. This should fix this issue. https://developer.nvidia.com/cuda-downloads

I still plan to take a look to see if I can add a work around for 11.6.

eromero-vlc commented 1 year ago

BTW I'm getting the same error with the default environment in perlmutter. It's using CUDA 11.7.64 and gcc 11.2

maddyscientist commented 1 year ago

11.7.64 is the original release of 11.7 not the updated release. You will need to switch to either 11.8 or update to 11.7u1.

kostrzewa commented 1 year ago

Just noting that CUDA 11.7.u1 has now been released. This should fix this issue. https://developer.nvidia.com/cuda-downloads

Is there any chance that this can be worked around in any way without having to ask centers to upgrade to 11.7u1, 11.8 or newer? EasyBuild has not updated the OpenMPI / UCX-CUDA / CUDA / GCC combo regularly and as a consequence, on many machines that I run on, one is left with basically three choices:

working 11.3
custom 11.5 (only available in one place)
broken 11.7

While 11.3 works and this is what I use most (except for 11.5 in the one place), the remaining software stack that 11.3 is made available with is from early 2020, is not maintained and things are beginning to break.

Going via EasyBuild, the only upgrade path that I can see for centers is CUDA 12 and I suspect that many are hesitant to take this step, even if asked nicely by some LQCD users.

lattice / quda

Wilson MG fails with cuda 11.6 #1307