parthenon-hpc-lab / parthenon

Parthenon AMR infrastructure
https://parthenon-hpc-lab.github.io/parthenon/
Other
125 stars 37 forks source link

`MPI_Comm_dup` error on init on Frontier #1102

Open pgrete opened 5 months ago

pgrete commented 5 months ago

New day, new issues. I just tried the latest amd software stack on Frontier:

module load cpe/23.12
module load PrgEnv-amd
module load amd/5.7.1
module load craype-accel-amd-gfx90a cmake cray-hdf5-parallel cray-python ninja
export MPICH_GPU_SUPPORT_ENABLED=1

and this result in non-functional code (e.g., advection example):

Assertion failed in file ../src/mpid/common/cray/cray_gpu_ops.c at line 188: mpi_errno == MPI_SUCCESS
/opt/cray/pe/lib64/libmpi_amd.so.12(MPL_backtrace_show+0x26) [0x7fffebab367b]
/opt/cray/pe/lib64/libmpi_amd.so.12(+0x22bf374) [0x7fffeb4d9374]
/opt/cray/pe/lib64/libmpi_amd.so.12(+0x2725368) [0x7fffeb93f368]
/opt/cray/pe/lib64/libmpi_amd.so.12(+0x2168420) [0x7fffeb382420]
/opt/cray/pe/lib64/libmpi_amd.so.12(+0x1fa237c) [0x7fffeb1bc37c]
/opt/cray/pe/lib64/libmpi_amd.so.12(+0x1fa028c) [0x7fffeb1ba28c]
/opt/cray/pe/lib64/libmpi_amd.so.12(+0x6d4cf1) [0x7fffe98eecf1]
/opt/cray/pe/lib64/libmpi_amd.so.12(PMPI_Comm_dup+0x174) [0x7fffe98eef34]
/sw/frontier/spack-envs/base/opt/cray-sles15-zen3/cce-15.0.0/darshan-runtime-3.4.0-t6el25xrwgfg5j65rdrhrs3qjp4ojssp/lib/libdarshan.so.0(darshan_core_initialize+0xa8) [0x7fffebbd3f68]
/sw/frontier/spack-envs/base/opt/cray-sles15-zen3/cce-15.0.0/darshan-runtime-3.4.0-t6el25xrwgfg5j65rdrhrs3qjp4ojssp/lib/libdarshan.so.0(MPI_Init+0x7d) [0x7fffebbd3d0d]
/ccs/proj/ast146/pgrete/src/athenapk/external/parthenon/build-bisect-def-atomics-benfix-cpe2312/example/advection/advection-example() [0x335280a]
/ccs/proj/ast146/pgrete/src/athenapk/external/parthenon/build-bisect-def-atomics-benfix-cpe2312/example/advection/advection-example() [0x3050e40]
/lib64/libc.so.6(__libc_start_main+0xef) [0x7fffe89f924d]
/ccs/proj/ast146/pgrete/src/athenapk/external/parthenon/build-bisect-def-atomics-benfix-cpe2312/example/advection/advection-example() [0x2f4ce6a]
MPICH ERROR [Rank 0] [job id 2015481.11] [Tue Jun 11 08:41:29 2024] [frontier00491] - Abort(1): Internal error

srun: error: frontier00491: task 0: Exited with exit code 1
srun: Terminating StepId=2015481.11
pgrete commented 5 months ago

Same issue with PrgEnv-cray