lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
294 stars 99 forks source link

Out of bounds error when running multi-GPU/partitioned HISQ MG with long links dropped #1512

Open weinbe2 opened 1 week ago

weinbe2 commented 1 week ago

In brief, there is an oob error when running HISQ MG with long links dropped, though it can be triggered without ever dropping to a true coarser level. It only appears with non-zero partitioning; I haven't tested if running it with true multi-GPU is fine or not. There are no issues when "normal" HISQ MG is run (improved staggered on the pseudo-fine level as well), suggesting that something is going awry with switching between the improved staggered (outer level) and unimproved staggered (inner level) operators.

The error does not hit until the first solve, i.e. after setup as completed. It more specifically triggers when returning to the fine level from the pseudo-fine level, aka when going to applying the improved operator from the unimproved operator. The time at which it hits (when it does) depends on the local volume---no error on ~16^4, but it hits on the first iteration on ~24^4+. It does seem to be deterministic at fixed command incl volume, at least.

This error hits independent of if tuning is enabled or not.

A command that triggers it is as follows:

mpirun -np 1 ./staggered_invert_test \
  --mass 0.1 \
  --dim 24 24 24 24 --gridsize 1 1 1 1 --partition 8 \
  --dslash-type asqtad --tol 1e-5 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 2 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized-drop-long \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --nsrc 1 --niter 25 \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct --mg-nu-pre 0 0 --mg-nu-post 0 8 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 8 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose

This is roughly trimmed down as much as possible, the various combinations of mat and direct are non-default but required for HISQ MG as is currently implemented. As noted above you never actually need to enter a true coarse solve to trigger the error, but you do still need to compile with Nc = 24 for the KD-operator construction.

A representative error message is:

QMP m0,n1@viking-prod-259.nvidia.com error: abort: 1
MG level 0 (GPU): ERROR: qudaEventQuery_ returned CUDA_ERROR_ILLEGAL_ADDRESS
 (dslash_policy.hpp:398 in operator()())
 (rank 0, host viking-prod-259.nvidia.com, quda_api.cpp:72 in void quda::target::cuda::set_driver_error(CUresult, const char*, const char*, const char*, const char*, bool)())
MG level 0 (GPU):        last kernel called was (name=N4quda9StaggeredINS_12StaggeredArgIfLi3ELi4EL21QudaReconstructType_s18ELS2_18ELb1EL20QudaStaggeredPhase_s1EEEEE,volume=24x24x24x24,aux=policy_kernel=interior,GPU-offline,vol=331776,parity=2,precision=4,order=2,Ns=1,Nc=3,commDim=0001,xpay,n_rhs=1,comm=0001)

My cmake command was:

cmake -DQUDA_DIRAC_DEFAULT_OFF=ON \
      -DQUDA_DIRAC_STAGGERED=ON \
      -DCMAKE_BUILD_TYPE=DEVEL \
      -DQUDA_BACKWARDS=ON \
      -D CMAKE_INSTALL_PREFIX=/scratch/local/install \
      -DQUDA_PRECISION=12 \
      -DQUDA_RECONSTRUCT=4 \
      -DQUDA_MPI=ON \
      -DQUDA_FAST_COMPILE_DSLASH=ON \
      -DQUDA_FAST_COMPILE_REDUCE=ON \
      -DQUDA_GPU_ARCH=sm_80 \
      -DQUDA_MULTIGRID=ON \
      -DQUDA_MULTIGRID_NVEC_LIST="24" \
     [quda]
weinbe2 commented 1 week ago

When running with export CUDA_LAUNCH_BLOCKING=1, I can see it's being triggered by the ghost packing kernel:

MG level 0 (GPU): ERROR: qudaLaunchKernel returned an illegal memory access was encountered
 (/home/scratch.eweinberg_sw/2024-08-29QudaMilc/quda/lib/targets/cuda/quda_api.cpp:152 in qudaLaunchKernel())
 (rank 0, host ipp1-1776.nvidia.com, quda_api.cpp:58 in void quda::target::cuda::set_runtime_error(cudaError_t, const char*, const char*, const char*, const char*, bool)())
MG level 0 (GPU):        last kernel called was (name=N4quda4PackIfLi3ELb0EEE,volume=24x24x24x24,aux=policy_kernel,vol=331776,parity=2,precision=4,order=2,Ns=1,Nc=3,n_rhs=1,comm=0001,topo=1111,nFace=3,device-device,striped)
Stack trace (most recent call last):
#28   Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in
#27   Object "./staggered_invert_test", at 0x55dae0b93034, in
#26   Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x153a75b4fe3f, in __libc_start_main
#25   Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x153a75b4fd8f, in
#24   Object "./staggered_invert_test", at 0x55dae0b92bf9, in
#23   Object "./staggered_invert_test", at 0x55dae0b9863f, in
#22   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a7748233a, in invertQuda
#21   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a773278d9, in quda::solve(std::vector<void*, std::allocator<void*> > const&, std::vector<void*, std::allocator<void*> > const&, QudaInvertParam_s&, quda::GaugeField const&)
#20   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a77324f18, in quda::solve(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField> const&, quda::Dirac&, quda::Dirac&, quda::Dirac&, quda::Dirac&, QudaInvertParam_s&)
#19   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a77440931, in quda::GCR::operator()(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&)
#18   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a773b5b8c, in quda::MG::operator()(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&)
#17   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a7741ce00, in quda::CAGCR::operator()(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&)
#16   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a77329c1c, in quda::DiracM::operator()(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&) const
#15   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a774ea226, in quda::DiracImprovedStaggered::M(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&) const
#14   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772a72da, in quda::ApplyImprovedStaggered(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::GaugeField const&, quda::GaugeField const&, double, quda::vector_ref<quda::ColorSpinorField const> const&, int, bool, int const*, quda::TimeProfile&)
#13   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772b38db, in void quda::instantiate<quda::ImprovedStaggeredApply, quda::ReconstructStaggered, float, 3, quda::GaugeField const&, double&, int&, bool&, int const*&, quda::TimeProfile&>(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::GaugeField const&, quda::GaugeField const&, double&, int&, bool&, int const*&, quda::TimeProfile&)
#12   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772b3336, in quda::ImprovedStaggeredApply<float, 3, (QudaReconstructType_s)18>::ImprovedStaggeredApply(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::GaugeField const&, quda::GaugeField const&, double, int, bool, int const*, quda::TimeProfile&)
#11   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772af731, in quda::dslash::DslashPolicyTune<quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> > >::DslashPolicyTune(quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> >&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::ColorSpinorField const&, quda::TimeProfile&)
#10   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772aee07, in quda::dslash::DslashPolicyTune<quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> > >::apply(quda::qudaStream_t const&)
#9    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772be19c, in quda::dslash::DslashBasic<quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> > >::operator()(quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> >&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::ColorSpinorField const&, quda::TimeProfile&)
#8    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772b66f0, in void quda::dslash::issuePack<quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> > >(quda::ColorSpinorField const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> > const&, int, quda::MemoryLocation, int, int)
#7    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a774bb3fe, in quda::ColorSpinorField::packGhost(int, QudaParity_s, int, quda::qudaStream_t const&, quda::MemoryLocation*, quda::MemoryLocation, bool, double, double, double, int, quda::vector_ref<quda::ColorSpinorField const> const&) const
#6    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772d7f4e, in quda::PackGhost(void**, quda::ColorSpinorField const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::MemoryLocation, int, bool, int, bool, double, double, double, int, quda::qudaStream_t const&)
#5    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772f6792, in quda::GhostPack<float, 3>::GhostPack(quda::ColorSpinorField const&, quda::vector_ref<quda::ColorSpinorField const> const&, void**, quda::MemoryLocation, int, bool, int, bool, double, double, double, int, quda::qudaStream_t const&)
#4    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772f5a3a, in quda::Pack<float, 3, false>::apply(quda::qudaStream_t const&)
#3    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a77591756, in quda::qudaLaunchKernel(void const*, quda::TuneParam const&, quda::qudaStream_t const&, void const*)
#2    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a7758ff71, in quda::target::cuda::set_runtime_error(cudaError, char const*, char const*, char const*, char const*, bool)
#1    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a774b061b, in errorQuda_(char const*, char const*, int, ...)
#0    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a7750c63d, in quda::comm_abort(int)