Open matcabral opened 8 years ago
I just tested this with master and am still hitting the segmentation fault, and the backtrace shows it happening in malloc(), so there is some corruption going on here. Here's some valgrind output showing this corruption:
==23510== Invalid write of size 8 ==23510== at 0x4C2E253: memcpy@@GLIBC_2.14 (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==23510== by 0x5B33C4E: unpack_contiguous_loop (opal_datatype_unpack.h:96) ==23510== by 0x5B35212: opal_generic_simple_unpack (opal_datatype_unpack.c:384) ==23510== by 0x5B236F6: opal_convertor_unpack (opal_convertor.c:323) ==23510== by 0xF53BA32: mca_pml_ob1_recv_request_progress_frag (pml_ob1_recvreq.c:509) ==23510== by 0xF5371B2: mca_pml_ob1_recv_frag_callback_frag (pml_ob1_recvfrag.c:394) ==23510== by 0xEF16588: mca_btl_vader_poll_handle_frag (btl_vader_component.c:590) ==23510== by 0xEF14515: mca_btl_vader_check_fboxes (btl_vader_fbox.h:235) ==23510== by 0xEF168F5: mca_btl_vader_component_progress (btl_vader_component.c:689) ==23510== by 0x5B09698: opal_progress (opal_progress.c:222) ==23510== by 0x4E90D21: ompi_request_wait_completion (request.h:392) ==23510== by 0x4E90D5B: ompi_request_default_wait (req_wait.c:42) ==23510== Address 0x1a40f990 is 0 bytes after a block of size 32,768 alloc'd ==23510== at 0x4C29BFD: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==23510== by 0x5B59B75: opal_malloc (malloc.c:101) ==23510== by 0x107DEAF9: ompi_osc_gacc_long_start (osc_pt2pt_data_move.c:984) ==23510== by 0x107DFD87: process_get_acc_long (osc_pt2pt_data_move.c:1288) ==23510== by 0x107E0B87: process_frag (osc_pt2pt_data_move.c:1579) ==23510== by 0x107E0E76: ompi_osc_pt2pt_process_receive (osc_pt2pt_data_move.c:1657) ==23510== by 0x107D8B38: component_progress (osc_pt2pt_component.c:169) ==23510== by 0x5B09698: opal_progress (opal_progress.c:222) ==23510== by 0x4E90D21: ompi_request_wait_completion (request.h:392) ==23510== by 0x4E90D5B: ompi_request_default_wait (req_wait.c:42) ==23510== by 0x4F371F0: ompi_coll_base_sendrecv_zero (coll_base_barrier.c:63) ==23510== by 0x4F378AB: ompi_coll_base_barrier_intra_two_procs (coll_base_barrier.c:299)
I'm not well versed in this code, so any help/pointers as to a root cause would be appreciated.
This is with: