open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.18k stars 864 forks source link

Segfault running mpich lock_dt rma test #2203

Open matcabral opened 8 years ago

matcabral commented 8 years ago
host-36 /tmp> mpirun -np 2  -mca pml ob1 -mca btl self,vader,sm  ./lock_dt
[host-36:91324] *** Process received signal ***
[host-36:91324] Signal: Segmentation fault (11)
[host-36:91324] Signal code: Address not mapped (1)
[host-36:91324] Failing at address: 0x1eaa0c0
[host-36:91324] [ 0] /lib64/libpthread.so.0(+0xf130)[0x7f4e3d99e130]
[host-36:91324] [ 1] /lib64/libc.so.6(+0x147ce4)[0x7f4e3d715ce4]
[host-36:91324] [ 2] /tmp/matcabral_ompi_2.x/lib/libopen-pal.so.20(+0x5a5db)[0x7f4e3d0235db]
[host-36:91324] [ 3] /tmp/matcabral_ompi_2.x/lib/libopen-pal.so.20(opal_generic_simple_unpack+0x35c)[0x7f4e3d024c52]
[host-36:91324] [ 4] /tmp/matcabral_ompi_2.x/lib/libopen-pal.so.20(opal_convertor_unpack+0x2ab)[0x7f4e3d014966]
[host-36:91324] [ 5] /tmp/matcabral_ompi_2.x/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_request_progress_frag+0x185)[0x7f4e331c4b6d]
[host-36:91324] [ 6] /tmp/matcabral_ompi_2.x/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_frag+0x71)[0x7f4e331c0d47]
[host-36:91324] [ 7] /tmp/matcabral_ompi_2.x/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x18d)[0x7f4e33bf178f]
[host-36:91324] [ 8] /tmp/matcabral_ompi_2.x/lib/openmpi/mca_btl_vader.so(+0x49ca)[0x7f4e33bef9ca]
[host-36:91324] [ 9] /tmp/matcabral_ompi_2.x/lib/openmpi/mca_btl_vader.so(+0x6a0d)[0x7f4e33bf1a0d]
[host-36:91324] [10] /tmp/matcabral_ompi_2.x/lib/libopen-pal.so.20(opal_progress+0xa9)[0x7f4e3cffb611]
[host-36:91324] [11] /tmp/matcabral_ompi_2.x/lib/libmpi.so.20(+0x55cbd)[0x7f4e3dc00cbd]
[host-36:91324] [12] /tmp/matcabral_ompi_2.x/lib/libmpi.so.20(ompi_request_default_wait+0x27)[0x7f4e3dc00cf7]
[host-36:91324] [13] /tmp/matcabral_ompi_2.x/lib/libmpi.so.20(+0xdfd19)[0x7f4e3dc8ad19]
[host-36:91324] [14] /tmp/matcabral_ompi_2.x/lib/libmpi.so.20(ompi_coll_base_barrier_intra_two_procs+0x81)[0x7f4e3dc8b3d4]
[host-36:91324] [15] /tmp/matcabral_ompi_2.x/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_dec_fixed+0x57)[0x7f4e32f99f40]
[host-36:91324] [16] /tmp/matcabral_ompi_2.x/lib/libmpi.so.20(MPI_Barrier+0xff)[0x7f4e3dc1dec2]
[host-36:91324] [17] ./lock_dt[0x402944]
[host-36:91324] [18] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4e3d5efaf5]
[host-36:91324] [19] ./lock_dt[0x402549] 
[host-36:91324]End of error message 

This is with:

 git remote -v && git branch && git log -1
origin  https://github.com/open-mpi/ompi.git (fetch)
origin  https://github.com/open-mpi/ompi.git (push)
  master
(*) v2.0.x
commit 97511a1eed8432ab378d024013d58c7259c5a3a8
Merge: b05f6e3 b9bbc49
Author: Howard Pritchard <hppritcha@gmail.com>
Date:   Tue Oct 11 10:09:58 2016 -0600
awlauria commented 7 years ago

I just tested this with master and am still hitting the segmentation fault, and the backtrace shows it happening in malloc(), so there is some corruption going on here. Here's some valgrind output showing this corruption:

==23510== Invalid write of size 8 ==23510== at 0x4C2E253: memcpy@@GLIBC_2.14 (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==23510== by 0x5B33C4E: unpack_contiguous_loop (opal_datatype_unpack.h:96) ==23510== by 0x5B35212: opal_generic_simple_unpack (opal_datatype_unpack.c:384) ==23510== by 0x5B236F6: opal_convertor_unpack (opal_convertor.c:323) ==23510== by 0xF53BA32: mca_pml_ob1_recv_request_progress_frag (pml_ob1_recvreq.c:509) ==23510== by 0xF5371B2: mca_pml_ob1_recv_frag_callback_frag (pml_ob1_recvfrag.c:394) ==23510== by 0xEF16588: mca_btl_vader_poll_handle_frag (btl_vader_component.c:590) ==23510== by 0xEF14515: mca_btl_vader_check_fboxes (btl_vader_fbox.h:235) ==23510== by 0xEF168F5: mca_btl_vader_component_progress (btl_vader_component.c:689) ==23510== by 0x5B09698: opal_progress (opal_progress.c:222) ==23510== by 0x4E90D21: ompi_request_wait_completion (request.h:392) ==23510== by 0x4E90D5B: ompi_request_default_wait (req_wait.c:42) ==23510== Address 0x1a40f990 is 0 bytes after a block of size 32,768 alloc'd ==23510== at 0x4C29BFD: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==23510== by 0x5B59B75: opal_malloc (malloc.c:101) ==23510== by 0x107DEAF9: ompi_osc_gacc_long_start (osc_pt2pt_data_move.c:984) ==23510== by 0x107DFD87: process_get_acc_long (osc_pt2pt_data_move.c:1288) ==23510== by 0x107E0B87: process_frag (osc_pt2pt_data_move.c:1579) ==23510== by 0x107E0E76: ompi_osc_pt2pt_process_receive (osc_pt2pt_data_move.c:1657) ==23510== by 0x107D8B38: component_progress (osc_pt2pt_component.c:169) ==23510== by 0x5B09698: opal_progress (opal_progress.c:222) ==23510== by 0x4E90D21: ompi_request_wait_completion (request.h:392) ==23510== by 0x4E90D5B: ompi_request_default_wait (req_wait.c:42) ==23510== by 0x4F371F0: ompi_coll_base_sendrecv_zero (coll_base_barrier.c:63) ==23510== by 0x4F378AB: ompi_coll_base_barrier_intra_two_procs (coll_base_barrier.c:299)

I'm not well versed in this code, so any help/pointers as to a root cause would be appreciated.