Open agrippa opened 6 years ago
@hppritcha @tonycurtis
bus error usually means the remote segment was detached before another process finished writing. probably deregister_memory_regions() should be called after some out-of-band barrier
Related to https://github.com/openucx/ucx/issues/2050 ?
Hi all,
We're seeing the following error message during finalization of an OpenSHMEM program running on top of UCX 1.3.0:
Caught signal 7 (Bus error: nonexistent physical address)
The generated core dump contains the following stack trace:
0 uct_mm_ep_update_cached_tail (ep=0x119dd020, ep=0x119dd020) at sm/mm/mm_ep.c:202
1 uct_mm_ep_flush (tl_ep=0x119dd020, flags=0, comp=) at sm/mm/mm_ep.c:420
2 0x0000ffff7a095a5c in uct_ep_flush (comp=0x119fb788, flags=, ep=0x119dd020) at /home/hpp/ucx/src/uct/api/uct.h:2050
3 ucp_ep_flush_progress (req=req@entry=0x119fb700) at rma/flush.c:48
4 0x0000ffff7a095fa8 in ucp_ep_flush_internal (ep=ep@entry=0x119dcfb0, uct_flags=uct_flags@entry=0, req_cb=req_cb@entry=0x0, req_flags=req_flags@entry=0,
5 0x0000ffff7a08bfb4 in ucp_ep_close_nb (ep=0x119dcfb0, mode=mode@entry=1) at core/ucp_ep.c:614
6 0x0000ffff7a2be698 in blocking_ep_disconnect (ep=) at ucx-init.c:279
7 disconnect_all_endpoints () at ucx-init.c:317
8 shmemc_ucx_finalize () at ucx-init.c:457
9 0x0000ffff7a2be004 in shmemc_finalize () at shmemc-init.c:56
10 0x0000ffff7a2f7038 in finalize_helper () at init.c:54
11 shmem_finalize () at init.c:156
12 0x0000000000400b8c in main ()
The OpenSHMEM program that triggered this error is a simple hello world:
We're running on top of the OSSS implementation of OpenSHMEM (https://bitbucket.org/sbuopenshmem/osss-ucx/commits/all?search=) using the xpmem transport. This is a small shared memory run, 4 PEs in a single box running on 4 ARM cores.
Please let me know what other information I can provide.