Open pgrete opened 5 months ago
New day, new issues. I just tried the latest amd software stack on Frontier:
module load cpe/23.12 module load PrgEnv-amd module load amd/5.7.1 module load craype-accel-amd-gfx90a cmake cray-hdf5-parallel cray-python ninja export MPICH_GPU_SUPPORT_ENABLED=1
and this result in non-functional code (e.g., advection example):
Assertion failed in file ../src/mpid/common/cray/cray_gpu_ops.c at line 188: mpi_errno == MPI_SUCCESS /opt/cray/pe/lib64/libmpi_amd.so.12(MPL_backtrace_show+0x26) [0x7fffebab367b] /opt/cray/pe/lib64/libmpi_amd.so.12(+0x22bf374) [0x7fffeb4d9374] /opt/cray/pe/lib64/libmpi_amd.so.12(+0x2725368) [0x7fffeb93f368] /opt/cray/pe/lib64/libmpi_amd.so.12(+0x2168420) [0x7fffeb382420] /opt/cray/pe/lib64/libmpi_amd.so.12(+0x1fa237c) [0x7fffeb1bc37c] /opt/cray/pe/lib64/libmpi_amd.so.12(+0x1fa028c) [0x7fffeb1ba28c] /opt/cray/pe/lib64/libmpi_amd.so.12(+0x6d4cf1) [0x7fffe98eecf1] /opt/cray/pe/lib64/libmpi_amd.so.12(PMPI_Comm_dup+0x174) [0x7fffe98eef34] /sw/frontier/spack-envs/base/opt/cray-sles15-zen3/cce-15.0.0/darshan-runtime-3.4.0-t6el25xrwgfg5j65rdrhrs3qjp4ojssp/lib/libdarshan.so.0(darshan_core_initialize+0xa8) [0x7fffebbd3f68] /sw/frontier/spack-envs/base/opt/cray-sles15-zen3/cce-15.0.0/darshan-runtime-3.4.0-t6el25xrwgfg5j65rdrhrs3qjp4ojssp/lib/libdarshan.so.0(MPI_Init+0x7d) [0x7fffebbd3d0d] /ccs/proj/ast146/pgrete/src/athenapk/external/parthenon/build-bisect-def-atomics-benfix-cpe2312/example/advection/advection-example() [0x335280a] /ccs/proj/ast146/pgrete/src/athenapk/external/parthenon/build-bisect-def-atomics-benfix-cpe2312/example/advection/advection-example() [0x3050e40] /lib64/libc.so.6(__libc_start_main+0xef) [0x7fffe89f924d] /ccs/proj/ast146/pgrete/src/athenapk/external/parthenon/build-bisect-def-atomics-benfix-cpe2312/example/advection/advection-example() [0x2f4ce6a] MPICH ERROR [Rank 0] [job id 2015481.11] [Tue Jun 11 08:41:29 2024] [frontier00491] - Abort(1): Internal error srun: error: frontier00491: task 0: Exited with exit code 1 srun: Terminating StepId=2015481.11
Same issue with PrgEnv-cray
PrgEnv-cray
New day, new issues. I just tried the latest amd software stack on Frontier:
and this result in non-functional code (e.g., advection example):