icl-utk-edu / slate

SLATE is a distributed, GPU-accelerated, dense linear algebra library targetting current and upcoming high-performance computing (HPC) systems. It is developed as part of the U.S. Department of Energy Exascale Computing Project (ECP).
https://icl.utk.edu/slate/
BSD 3-Clause "New" or "Revised" License
84 stars 20 forks source link

Segmentation fault in slate::gesv when using CUDA-aware MPI #154

Open liamscarlett opened 8 months ago

liamscarlett commented 8 months ago

Description I have been successfully running gesv using GPU-aware MPI on an AMD machine with HIP (Setonix @ Pawsey Supercomputing Centre Australia). But I am getting seg faults trying to do the same on NVIDIA GPUs (both on Gadi @ NCI Australia using CUDA-aware OpenMPI, and Frontera @ TACC using CUDA-aware Intel-MPI).

I am running SLATE's provided example code for gesv (examples/ex06_linear_system_lu.cc) modified only slightly to set the target to devices in the gesv options, and with SLATE_GPU_AWARE_MPI=0 it runs fine, but with SLATE_GPU_AWARE_MPI=1 I get a seg fault with the following backtrace (on Frontera):

rank 1: void test_lu() [with scalar_type = float]
rank 2: void test_lu() [with scalar_type = float]
rank 3: void test_lu() [with scalar_type = float]
mpi_size 4, grid_p 2, grid_q 2
rank 0: void test_lu() [with scalar_type = float]
[c197-101:24026:0:24026] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b1588ec2c10)
[c197-101:24025:0:24025] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b4360ec2c10)
==== backtrace (tid:  24026) ====
 0 0x000000000004cb95 ucs_debug_print_backtrace()  ???:0
 1 0x000000000089fa45 bdw_memcpy_write()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_memcpy.h:128
 2 0x000000000089fa45 bdw_memcpy_write()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_memcpy.h:123
 3 0x000000000089bce9 write_to_cell()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_memcpy.h:326
 4 0x000000000089bce9 send_cell()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_send.h:890
 5 0x00000000008959a4 MPIDI_POSIX_eager_send()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_send.h:1540
 6 0x00000000004a0989 MPIDI_POSIX_eager_send()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/posix_eager_impl.h:37
 7 0x00000000004a0989 MPIDI_POSIX_am_isend()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/posix_am.h:220
 8 0x00000000004a0989 MPIDI_SHM_am_isend()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_am.h:49
 9 0x00000000004a0989 MPIDIG_isend_impl()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/generic/mpidig_send.h:116
10 0x00000000004a176d MPIDIG_am_isend()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/generic/mpidig_send.h:172
11 0x00000000004a176d MPIDIG_mpi_isend()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/generic/mpidig_send.h:233
12 0x00000000004a176d MPIDI_POSIX_mpi_isend()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/posix_send.h:59
13 0x00000000004a176d MPIDI_SHM_mpi_isend()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_p2p.h:187
14 0x00000000004a176d MPIDI_isend_unsafe()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:314
15 0x00000000004a176d MPIDI_isend_safe()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:609
16 0x00000000004a176d MPID_Isend()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:828
17 0x00000000004a176d PMPI_Isend()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/pt2pt/isend.c:132
18 0x000000000042fd7a slate::Tile<float>::isend()  ???:0
19 0x000000000043b6f4 slate::BaseMatrix<float>::tileIbcastToSet()  ???:0
20 0x00000000004c0855 slate::BaseMatrix<float>::listBcast<(slate::Target)68>()  ???:0
21 0x000000000063a0c6 slate::impl::getrf<(slate::Target)68, float>()  getrf.cc:0
22 0x00000000000163ec GOMP_taskwait()  /admin/build/admin/rpms/frontera/BUILD/gcc-9.1.0/x86_64-pc-linux-gnu/libgomp/../.././libgomp/task.c:1537
23 0x000000000061eac1 slate::impl::getrf<(slate::Target)68, float>()  getrf.cc:0
24 0x0000000000012a22 GOMP_parallel()  /admin/build/admin/rpms/frontera/BUILD/gcc-9.1.0/x86_64-pc-linux-gnu/libgomp/../.././libgomp/parallel.c:171
25 0x0000000000620256 slate::impl::getrf<(slate::Target)68, float>()  ???:0
26 0x00000000006204ff slate::getrf<float>()  ???:0
27 0x00000000006038d7 slate::gesv<float>()  ???:0
28 0x00000000004199de test_lu<float>()  ???:0
29 0x000000000041809d main()  ???:0
30 0x0000000000022555 __libc_start_main()  ???:0
31 0x0000000000411709 _start()  ???:0
=================================

and on Gadi:

mpi_size 4, grid_p 2, grid_q 2
rank 0: void test_lu() [with scalar_type = float]
rank 1: void test_lu() [with scalar_type = float]
rank 2: void test_lu() [with scalar_type = float]
rank 3: void test_lu() [with scalar_type = float]
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[1702299288.183060] [gadi-gpu-v100-0002:4184291:0]         cuda_md.c:162  UCX  ERROR cuMemGetAddressRange(0x15449a8c2e10) error: named symbol not found
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[1702299288.193415] [gadi-gpu-v100-0002:4184290:0]         cuda_md.c:162  UCX  ERROR cuMemGetAddressRange(0x1493128c2e10) error: named symbol not found
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[gadi-gpu-v100-0002:4184290:0:4184290] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x1493128c2e20)
[gadi-gpu-v100-0002:4184291:0:4184291] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15449a8c2e20)
[1702299288.438796] [gadi-gpu-v100-0002:4184288:0]         cuda_md.c:162  UCX  ERROR cuMemGetAddressRange(0x147bf7cc2440) error: named symbol not found
[gadi-gpu-v100-0002:4184288:0:4184288] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x147bf7cc2440)
==== backtrace (tid:4184291) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x00000000000cf006 __memmove_avx_unaligned_erms()  :0
 2 0x0000000000089039 ucp_eager_only_handler()  ???:0
 3 0x000000000001687d uct_mm_iface_progress()  :0
 4 0x000000000004951a ucp_worker_progress()  ???:0
 5 0x00000000000ab8cf mca_pml_ucx_recv()  /jobfs/53639599.gadi-pbs/0/openmpi/4.1.4/source/openmpi-4.1.4/ompi/mca/pml/ucx/pml_ucx.c:646
 6 0x0000000000208915 PMPI_Recv()  /jobfs/53639599.gadi-pbs/0/openmpi/4.1.4/build/gcc/ompi/precv.c:82
 7 0x000000000065a550 slate::Tile<float>::recv()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/./include/slate/Tile.hh:1094
 8 0x0000000000b0a1ca slate::BaseMatrix<float>::tileIbcastToSet()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/./include/slate/BaseMatrix.hh:2403
 9 0x0000000000b0a1ca ???()  /half-root/usr/include/c++/8/bits/shared_ptr_base.h:1013
10 0x0000000000b0a1ca ???()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/./include/slate/BaseMatrix.hh:492
11 0x0000000000b0a1ca slate::BaseMatrix<float>::tileIbcastToSet()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/./include/slate/BaseMatrix.hh:2404
12 0x0000000000c5d690 slate::BaseMatrix<float>::listBcast<(slate::Target)68>()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/./include/slate/BaseMatrix.hh:1979
13 0x000000000101dcf3 slate::impl::getrf<(slate::Target)68, float>()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/src/getrf.cc:106
14 0x0000000000109f09 _INTERNAL8bc508f1::__kmp_invoke_task()  /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_tasking.cpp:1856
15 0x000000000011094b __kmp_omp_task()  /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_tasking.cpp:1974
16 0x0000000000101007 __kmpc_omp_task_with_deps()  /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_taskdeps.cpp:734
17 0x000000000101d2e0 L__ZN5slate4impl5getrfILNS_6TargetE68EfEEvRNS_6MatrixIT0_EERSt6vectorIS7_INS_5PivotESaIS8_EESaISA_EERKSt3mapINS_6OptionENS_11OptionValueESt4lessISF_ESaISt4pairIKSF_SG_EEE_84__par_region0_2_615()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/src/getrf.cc:93
18 0x0000000000163493 __kmp_invoke_microtask()  ???:0
19 0x00000000000d1ca4 _INTERNAL49d8b4ea::__kmp_serial_fork_call()  /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_runtime.cpp:2004
20 0x00000000000d1ca4 __kmp_fork_call()  /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_runtime.cpp:2329
21 0x0000000000089d23 __kmpc_fork_call()  /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_csupport.cpp:350
22 0x000000000101c2bb slate::impl::getrf<(slate::Target)68, float>()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/src/getrf.cc:84
23 0x0000000001014fa5 slate::getrf<float>()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/src/getrf.cc:341
24 0x0000000000f3597e slate::gesv<float>()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/src/gesv.cc:95
25 0x000000000041cb18 test_lu<float>()  /scratch/d35/lhs573/test_slate/ex06_linear_system_lu.cc:28
26 0x000000000041ab6c main()  /scratch/d35/lhs573/test_slate/ex06_linear_system_lu.cc:132
27 0x000000000003ad85 __libc_start_main()  ???:0
28 0x000000000041a98e _start()  ???:0
=================================

Steps To Reproduce

  1. Modify the ex06_linear_system_lu.cc test code to pass the option {{slate::Option::Target, slate::Target::Devices}} to the slate::gesv call
  2. export SLATE_GPU_AWARE_MPI=1
  3. Run a multi-GPU job on an NVIDIA machine

Environment The more information that you can provide about your environment, the simpler it is for us to understand and reproduce the issue. BELOW INFORMATION GIVEN FOR FRONTERA (TACC) MACHINE

mgates3 commented 8 months ago

Thanks for the detailed bug report. We will need to investigate.

lzjia-jia commented 1 month ago

What is the make.inc configuration file you used when installing SLATE with HIP?