pmodels / yaksa

Yaksa: High-performance Noncontiguous Data Management
Other
14 stars 24 forks source link

AddressSanitizer: SEGV `src/util/yaksu_handle_pool.c:181` in `yaksu_handle_pool_elem_get()` #245

Open Jacobfaib opened 1 year ago

Jacobfaib commented 1 year ago

The error

Caught signal 11 (Segmentation fault: address not mapped to object at address 0x170)
==== backtrace (tid:1229674) ====
 0  /home/ac.jfaibussowitsch/petsc/arch-cuda-debug/lib/libucs.so.0(ucs_debug_print_backtrace+0x33) [0x7fea6f92bcad]
 1  /home/ac.jfaibussowitsch/petsc/arch-cuda-debug/lib/libucs.so.0(ucs_handle_error+0x77) [0x7fea6f92ce0f]
 2  /home/ac.jfaibussowitsch/petsc/arch-cuda-debug/lib/libucs.so.0(+0x37bca) [0x7fea6f92cbca]
 3  /home/ac.jfaibussowitsch/petsc/arch-cuda-debug/lib/libucs.so.0(+0x37d2c) [0x7fea6f92cd2c]
 4  /home/ac.jfaibussowitsch/petsc/arch-cuda-debug/lib/libmpi.so.0(+0xbc3d12) [0x7fea2a9a1d12]
 5  /home/ac.jfaibussowitsch/petsc/arch-cuda-debug/lib/libmpi.so.0(+0xbd6a69) [0x7fea2a9b4a69]
 6  /home/ac.jfaibussowitsch/petsc/arch-cuda-debug/lib/libmpi.so.0(+0xbd376e) [0x7fea2a9b176e]
 7  /home/ac.jfaibussowitsch/petsc/arch-cuda-debug/lib/libmpi.so.0(+0xa1084b) [0x7fea2a7ee84b]
 8  /home/ac.jfaibussowitsch/petsc/arch-cuda-debug/lib/libmpi.so.0(+0xa4ab68) [0x7fea2a828b68]
 9  /home/ac.jfaibussowitsch/petsc/arch-cuda-debug/lib/libmpi.so.0(+0xa4a8bb) [0x7fea2a8288bb]
10  /home/ac.jfaibussowitsch/petsc/arch-cuda-debug/lib/libmpi.so.0(+0x84fad5) [0x7fea2a62dad5]
11  /home/ac.jfaibussowitsch/petsc/arch-cuda-debug/lib/libmpi.so.0(PMPI_Init+0x27) [0x7fea2a62db72]
12  ./yaksa_test(+0x125f) [0x559bf2c2325f]
13  /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fea283a6d90]
14  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fea283a6e40]
15  ./yaksa_test(_start+0x25) [0x559bf2c230e5]
=================================
AddressSanitizer:DEADLYSIGNAL
=================================================================
==1229674==ERROR: AddressSanitizer: SEGV on unknown address 0x25320012c36a (pc 0x7fea2a9a1d12 bp 0x7fff52366090 sp 0x7fff52366040 T0)
==1229674==The signal is caused by a READ memory access.
    #0 0x7fea2a9a1d12 in yaksu_handle_pool_elem_get src/util/yaksu_handle_pool.c:181
    #1 0x7fea2a9b4a68 in yaksi_type_get src/frontend/types/yaksi_type.c:49
    #2 0x7fea2a9b176d in yaksa_type_create_contig src/frontend/types/yaksa_contig.c:76
    #3 0x7fea2a7ee84a in MPIR_Typerep_init src/mpi/datatype/typerep/src/typerep_yaksa_init.c:420
    #4 0x7fea2a828b67 in MPII_Init_thread src/mpi/init/mpir_init.c:165
    #5 0x7fea2a8288ba in MPIR_Init_impl src/mpi/init/mpir_init.c:102
    #6 0x7fea2a62dad4 in internal_Init src/binding/c/c_binding.c:45678
    #7 0x7fea2a62db71 in PMPI_Init src/binding/c/c_binding.c:45730
    #8 0x559bf2c2325e in main /home/ac.jfaibussowitsch/petsc/src/ksp/ksp/tests/yaksa_test.c:8
    #9 0x7fea283a6d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #10 0x7fea283a6e3f in __libc_start_main_impl ../csu/libc-start.c:392
    #11 0x559bf2c230e4 in _start (/scratch/jfaibussowitsch/petsc/src/ksp/ksp/tests/yaksa_test+0x10e4)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV src/util/yaksu_handle_pool.c:181 in yaksu_handle_pool_elem_get
==1229674==ABORTING

To reproduce

// mpicc -fsanitize=address yaksa_segv.c -o yaksa_segv
#include <mpi.h>

int main(int argc, char *argv[])
{
  MPI_Init(&argc, &argv);
}

The problem

$ gdb ./yaksa_segv
...
Thread 1 "bench_debug" received signal SIGSEGV, Segmentation fault.
0x00007fff973a1d12 in yaksu_handle_pool_elem_get (pool=0x0, handle=38, data=0x7fffffffbf28) at src/util/yaksu_handle_pool.c:181
181         assert(handle_pool->handle_cache[handle]);
(gdb) p handle_pool
$1 = (handle_pool_s *) 0x0

yaksa_config.log mpich_config.log

Jacobfaib commented 1 year ago

I have reduced this problem down to interference from ASAN (https://github.com/google/sanitizers/issues/629) with CUDA runtime. This causes CUDA allocation functions to mysteriously fail, which yaksa apparently fails to check for. This problem manifests later as the above bug. Yaksa should check that allocation functions succeed or fail appropriately.

The fix for users is to globally set the ASAN option protect_shadow_gap=0 via

$ ASAN_OPTIONS=protect_shadow_gap=0 ./user_app

or in source via

extern "C" const char *__asan_default_options() { return "protect_shadow_gap=0"; }

Perhaps yaksa can set this when it detects ASAN when built with CUDA support. This can be done at compile time via

#ifndef __has_feature
  #define __has_feature(x) 0
#endif

#if __has_feature(address_sanitizer) || defined(__SANITIZE_ADDRESS__)
  // ASAN active
#endif