parthenon-hpc-lab / parthenon

Parthenon AMR infrastructure
https://parthenon-hpc-lab.github.io/parthenon/
Other
112 stars 33 forks source link

HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION #1101

Open pgrete opened 3 months ago

pgrete commented 3 months ago

On Frontier I see a (or to be more specific many of the following) :0:rocdevice.cpp :2660: 556940992572 us: 32834: [tid:0x7f9e41945700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29 when running the following input file

$ srun -N 128 -n 1024 -c 1 --gpus-per-node=8 --gpu-bind=closest /ccs/proj/ast146/pgrete/src/athenapk/external/parthenon/build-bisect-latest-next-dev/example/advection/advection-example -i parthinput.advection_smaller
<parthenon/job>
problem_id = advection

<parthenon/mesh>
refinement  = static
nghost = 2

nx1        = 1024       # Number of zones in X1-direction
x1min      =-3.2     # minimum value of X1
x1max      = 3.2     # maximum value of X1
ix1_bc     = periodic        # inner-X1 boundary flag
ox1_bc     = periodic        # outer-X1 boundary flag

nx2        = 1024       # Number of zones in X2-direction
x2min      =-3.2     # minimum value of X2
x2max      = 3.2     # maximum value of X2
ix2_bc     = periodic        # inner-X2 boundary flag
ox2_bc     = periodic        # outer-X2 boundary flag

nx3        = 1024       # Number of zones in X3-direction
x3min      =-3.2     # minimum value of X3
x3max      = 3.2     # maximum value of X3
ix3_bc     = periodic        # inner-X3 boundary flag
ox3_bc     = periodic        # outer-X3 boundary flag

<parthenon/meshblock>
nx1        = 128        # Number of zones in X1-direction
nx2        = 128        # Number of zones in X2-direction
nx3        = 128        # Number of zones in X3-direction

<parthenon/static_refinement4>
x1min = -0.4 
x1max =  0.4
x2min = -0.4
x2max =  0.4
x3min = -0.4
x3max =  0.4
level = 4

<parthenon/static_refinement5>
x1min = -0.2 
x1max =  0.2
x2min = -0.2
x2max =  0.2
x3min = -0.2
x3max =  0.2
level = 5

#<parthenon/static_refinement6>
#x1min = -0.1125 
#x1max =  0.1125
#x2min = -0.1125
#x2max =  0.1125
#x3min = -0.1125
#x3max =  0.1125
#level = 6

<parthenon/time>
tlim = 1.0
integrator = rk1
nlim = 100
ncycle_out_mesh = -100000

<Advection>
cfl = 0.30
vx = 1.0
vy = 2.0
vz = 3.0
profile = smooth_gaussian
ang_2 = 0.0
ang_3 = 0.0
ang_2_vert = false
ang_3_vert = false
amp = 1.0 

num_vars = 5
#vec_size = 5

refine_tol = 1.01    # control the package specific refinement tagging function
derefine_tol = 1.001
compute_error = true

<parthenon/output0>
file_type = rst 
dt = 1.0 

and current develop (b28c738).

Changing

num_vars = 5
#vec_size = 5

to

#num_vars = 5
vec_size = 5

shows no issues.

BenWibking commented 3 months ago

We've seen this error in AMReX codes due to a HIP compiler bug (e.g.: https://github.com/AMReX-Astro/Microphysics/issues/1386#issuecomment-1854829106)

Adding -mllvm -amdgpu-function-calls=true to the HIP compiler flags works around that issue. Does that help for this case?

pgrete commented 3 months ago

Which compiler are you using? I just tried with Cray (which I've been using so far) and it didn't help.

BenWibking commented 3 months ago

I think I've used only hipcc/amdclang++ for HIP builds recently (i.e., -DCMAKE_CXX_COMPILER=hipcc). But I think I had the PrgEnv-cray modules loaded, so I don't know what it's actually doing 🤷 .

BenWibking commented 3 months ago

Although we only saw this problem for very large kernels (e.g., with reaction networks), so it may not be related.

BenWibking commented 3 months ago

I've also tried https://rocm.docs.amd.com/en/latest/conceptual/using-gpu-sanitizer.html#compiling-for-address-sanitizer to debug these memory errors. This sometimes worked, but it also produces some false positives with global vars...

pgrete commented 3 months ago

I now tried the warpx recommendations, i.e.,

# from https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html#frontier-olcf

module load cmake/3.23.2
module load craype-accel-amd-gfx90a
module load rocm/5.2.0  # waiting for 5.6 for next bump
module load cray-mpich
module load cce/15.0.0
module load ninja
module load hdf5/1.14.0

# compiler environment hints
export CC=$(which hipcc)
export CXX=$(which hipcc)
export FC=$(which ftn)
export CFLAGS="-I${ROCM_PATH}/include"
export CXXFLAGS="-I${ROCM_PATH}/include -Wno-pass-failed"
export LDFLAGS="-L${ROCM_PATH}/lib -lamdhip64 ${PE_MPICH_GTL_DIR_amd_gfx90a} -lmpi_gtl_hsa"

export MPICH_GPU_SUPPORT_ENABLED=1

still same issue.

BenWibking commented 3 months ago

Ah, well, nevermind :/

BenWibking commented 3 months ago

Does running with -DENABLE_ASAN=ON show anything?