Open pgrete opened 3 months ago
We've seen this error in AMReX codes due to a HIP compiler bug (e.g.: https://github.com/AMReX-Astro/Microphysics/issues/1386#issuecomment-1854829106)
Adding -mllvm -amdgpu-function-calls=true
to the HIP compiler flags works around that issue. Does that help for this case?
Which compiler are you using? I just tried with Cray (which I've been using so far) and it didn't help.
I think I've used only hipcc
/amdclang++
for HIP builds recently (i.e., -DCMAKE_CXX_COMPILER=hipcc
). But I think I had the PrgEnv-cray modules loaded, so I don't know what it's actually doing 🤷 .
Although we only saw this problem for very large kernels (e.g., with reaction networks), so it may not be related.
I've also tried https://rocm.docs.amd.com/en/latest/conceptual/using-gpu-sanitizer.html#compiling-for-address-sanitizer to debug these memory errors. This sometimes worked, but it also produces some false positives with global vars...
I now tried the warpx recommendations, i.e.,
# from https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html#frontier-olcf
module load cmake/3.23.2
module load craype-accel-amd-gfx90a
module load rocm/5.2.0 # waiting for 5.6 for next bump
module load cray-mpich
module load cce/15.0.0
module load ninja
module load hdf5/1.14.0
# compiler environment hints
export CC=$(which hipcc)
export CXX=$(which hipcc)
export FC=$(which ftn)
export CFLAGS="-I${ROCM_PATH}/include"
export CXXFLAGS="-I${ROCM_PATH}/include -Wno-pass-failed"
export LDFLAGS="-L${ROCM_PATH}/lib -lamdhip64 ${PE_MPICH_GTL_DIR_amd_gfx90a} -lmpi_gtl_hsa"
export MPICH_GPU_SUPPORT_ENABLED=1
still same issue.
Ah, well, nevermind :/
Does running with -DENABLE_ASAN=ON
show anything?
On Frontier I see a (or to be more specific many of the following)
:0:rocdevice.cpp :2660: 556940992572 us: 32834: [tid:0x7f9e41945700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29
when running the following input fileand current
develop
(b28c738).Changing
to
shows no issues.