openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.14k stars 423 forks source link

Bus error on UCX 1.3.0 #2601

Open agrippa opened 6 years ago

agrippa commented 6 years ago

Hi all,

We're seeing the following error message during finalization of an OpenSHMEM program running on top of UCX 1.3.0:

Caught signal 7 (Bus error: nonexistent physical address)

The generated core dump contains the following stack trace:

0 uct_mm_ep_update_cached_tail (ep=0x119dd020, ep=0x119dd020) at sm/mm/mm_ep.c:202

1 uct_mm_ep_flush (tl_ep=0x119dd020, flags=0, comp=) at sm/mm/mm_ep.c:420

2 0x0000ffff7a095a5c in uct_ep_flush (comp=0x119fb788, flags=, ep=0x119dd020) at /home/hpp/ucx/src/uct/api/uct.h:2050

3 ucp_ep_flush_progress (req=req@entry=0x119fb700) at rma/flush.c:48

4 0x0000ffff7a095fa8 in ucp_ep_flush_internal (ep=ep@entry=0x119dcfb0, uct_flags=uct_flags@entry=0, req_cb=req_cb@entry=0x0, req_flags=req_flags@entry=0,

flushed_cb=flushed_cb@entry=0xffff7a089ac8 <ucp_ep_close_flushed_callback>) at rma/flush.c:215

5 0x0000ffff7a08bfb4 in ucp_ep_close_nb (ep=0x119dcfb0, mode=mode@entry=1) at core/ucp_ep.c:614

6 0x0000ffff7a2be698 in blocking_ep_disconnect (ep=) at ucx-init.c:279

7 disconnect_all_endpoints () at ucx-init.c:317

8 shmemc_ucx_finalize () at ucx-init.c:457

9 0x0000ffff7a2be004 in shmemc_finalize () at shmemc-init.c:56

10 0x0000ffff7a2f7038 in finalize_helper () at init.c:54

11 shmem_finalize () at init.c:156

12 0x0000000000400b8c in main ()

The OpenSHMEM program that triggered this error is a simple hello world:

#include <stdio.h>
#include <shmem.h>

int main(int argc, char **argv) {
shmem_init();
int pe = shmem_my_pe();
int npes = shmem_n_pes();
printf("Hi from %d / %d\n", pe, npes);
shmem_finalize();
return 0;
} 

We're running on top of the OSSS implementation of OpenSHMEM (https://bitbucket.org/sbuopenshmem/osss-ucx/commits/all?search=) using the xpmem transport. This is a small shared memory run, 4 PEs in a single box running on 4 ARM cores.

Please let me know what other information I can provide.

agrippa commented 6 years ago

@hppritcha @tonycurtis

yosefe commented 6 years ago

bus error usually means the remote segment was detached before another process finished writing. probably deregister_memory_regions() should be called after some out-of-band barrier

tonycurtis commented 6 years ago

Related to https://github.com/openucx/ucx/issues/2050 ?