Open weinbe2 opened 1 week ago
When running with export CUDA_LAUNCH_BLOCKING=1
, I can see it's being triggered by the ghost packing kernel:
MG level 0 (GPU): ERROR: qudaLaunchKernel returned an illegal memory access was encountered
(/home/scratch.eweinberg_sw/2024-08-29QudaMilc/quda/lib/targets/cuda/quda_api.cpp:152 in qudaLaunchKernel())
(rank 0, host ipp1-1776.nvidia.com, quda_api.cpp:58 in void quda::target::cuda::set_runtime_error(cudaError_t, const char*, const char*, const char*, const char*, bool)())
MG level 0 (GPU): last kernel called was (name=N4quda4PackIfLi3ELb0EEE,volume=24x24x24x24,aux=policy_kernel,vol=331776,parity=2,precision=4,order=2,Ns=1,Nc=3,n_rhs=1,comm=0001,topo=1111,nFace=3,device-device,striped)
Stack trace (most recent call last):
#28 Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in
#27 Object "./staggered_invert_test", at 0x55dae0b93034, in
#26 Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x153a75b4fe3f, in __libc_start_main
#25 Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x153a75b4fd8f, in
#24 Object "./staggered_invert_test", at 0x55dae0b92bf9, in
#23 Object "./staggered_invert_test", at 0x55dae0b9863f, in
#22 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a7748233a, in invertQuda
#21 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a773278d9, in quda::solve(std::vector<void*, std::allocator<void*> > const&, std::vector<void*, std::allocator<void*> > const&, QudaInvertParam_s&, quda::GaugeField const&)
#20 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a77324f18, in quda::solve(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField> const&, quda::Dirac&, quda::Dirac&, quda::Dirac&, quda::Dirac&, QudaInvertParam_s&)
#19 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a77440931, in quda::GCR::operator()(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&)
#18 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a773b5b8c, in quda::MG::operator()(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&)
#17 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a7741ce00, in quda::CAGCR::operator()(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&)
#16 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a77329c1c, in quda::DiracM::operator()(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&) const
#15 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a774ea226, in quda::DiracImprovedStaggered::M(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&) const
#14 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772a72da, in quda::ApplyImprovedStaggered(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::GaugeField const&, quda::GaugeField const&, double, quda::vector_ref<quda::ColorSpinorField const> const&, int, bool, int const*, quda::TimeProfile&)
#13 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772b38db, in void quda::instantiate<quda::ImprovedStaggeredApply, quda::ReconstructStaggered, float, 3, quda::GaugeField const&, double&, int&, bool&, int const*&, quda::TimeProfile&>(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::GaugeField const&, quda::GaugeField const&, double&, int&, bool&, int const*&, quda::TimeProfile&)
#12 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772b3336, in quda::ImprovedStaggeredApply<float, 3, (QudaReconstructType_s)18>::ImprovedStaggeredApply(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::GaugeField const&, quda::GaugeField const&, double, int, bool, int const*, quda::TimeProfile&)
#11 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772af731, in quda::dslash::DslashPolicyTune<quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> > >::DslashPolicyTune(quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> >&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::ColorSpinorField const&, quda::TimeProfile&)
#10 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772aee07, in quda::dslash::DslashPolicyTune<quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> > >::apply(quda::qudaStream_t const&)
#9 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772be19c, in quda::dslash::DslashBasic<quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> > >::operator()(quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> >&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::ColorSpinorField const&, quda::TimeProfile&)
#8 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772b66f0, in void quda::dslash::issuePack<quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> > >(quda::ColorSpinorField const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> > const&, int, quda::MemoryLocation, int, int)
#7 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a774bb3fe, in quda::ColorSpinorField::packGhost(int, QudaParity_s, int, quda::qudaStream_t const&, quda::MemoryLocation*, quda::MemoryLocation, bool, double, double, double, int, quda::vector_ref<quda::ColorSpinorField const> const&) const
#6 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772d7f4e, in quda::PackGhost(void**, quda::ColorSpinorField const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::MemoryLocation, int, bool, int, bool, double, double, double, int, quda::qudaStream_t const&)
#5 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772f6792, in quda::GhostPack<float, 3>::GhostPack(quda::ColorSpinorField const&, quda::vector_ref<quda::ColorSpinorField const> const&, void**, quda::MemoryLocation, int, bool, int, bool, double, double, double, int, quda::qudaStream_t const&)
#4 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772f5a3a, in quda::Pack<float, 3, false>::apply(quda::qudaStream_t const&)
#3 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a77591756, in quda::qudaLaunchKernel(void const*, quda::TuneParam const&, quda::qudaStream_t const&, void const*)
#2 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a7758ff71, in quda::target::cuda::set_runtime_error(cudaError, char const*, char const*, char const*, char const*, bool)
#1 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a774b061b, in errorQuda_(char const*, char const*, int, ...)
#0 Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a7750c63d, in quda::comm_abort(int)
In brief, there is an oob error when running HISQ MG with long links dropped, though it can be triggered without ever dropping to a true coarser level. It only appears with non-zero partitioning; I haven't tested if running it with true multi-GPU is fine or not. There are no issues when "normal" HISQ MG is run (improved staggered on the pseudo-fine level as well), suggesting that something is going awry with switching between the improved staggered (outer level) and unimproved staggered (inner level) operators.
The error does not hit until the first solve, i.e. after setup as completed. It more specifically triggers when returning to the fine level from the pseudo-fine level, aka when going to applying the improved operator from the unimproved operator. The time at which it hits (when it does) depends on the local volume---no error on ~16^4, but it hits on the first iteration on ~24^4+. It does seem to be deterministic at fixed command incl volume, at least.
This error hits independent of if tuning is enabled or not.
A command that triggers it is as follows:
This is roughly trimmed down as much as possible, the various combinations of
mat
anddirect
are non-default but required for HISQ MG as is currently implemented. As noted above you never actually need to enter a true coarse solve to trigger the error, but you do still need to compile withNc = 24
for the KD-operator construction.A representative error message is:
My cmake command was: