openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.14k stars 424 forks source link

segfault in ucs_sysv_shmget_format_error #4657

Open bureddy opened 4 years ago

bureddy commented 4 years ago

seen it only once. not able to reproduce.

IP0ac71017:23093:0:23985] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x340)
==== backtrace (tid:  23985) ====
0 0x000000000003926d __GI_getenv()  :0
1 0x0000000000030dde __dcigettext()  :0
2 0x000000000008cbfe __GI___strerror_r()  :0
3 0x000000000008cb3f strerror()  ???:0
4 0x0000000000059038 ucs_sysv_shmget_format_error()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/ucs/sys/sys.
5 0x0000000000059038 ucs_sysv_alloc()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/ucs/sys/sys.c:748
6 0x000000000000dcbf uct_mem_alloc()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/uct/base/uct_mem.c:224
7 0x000000000000e0cb uct_iface_mem_alloc()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/uct/base/uct_mem.c:29
8 0x000000000000e1b3 uct_iface_mp_chunk_alloc()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/uct/base/uct_mem
9 0x000000000004b27c ucs_mpool_grow()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/ucs/datastruct/mpool.c:189
10 0x000000000004b420 ucs_mpool_get_grow()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/ucs/datastruct/mpool.c
11 0x000000000003ebbc uct_ud_iface_get_tx_skb()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/uct/ib/ud/base/ud
12 0x000000000003ebbc uct_ud_ep_prepare_crep()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/uct/ib/ud/base/ud_
13 0x000000000003ebbc uct_ud_ep_do_pending_ctl()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/uct/ib/ud/base/u
14 0x000000000003fc71 uct_ud_ep_do_pending()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/uct/ib/ud/base/ud_ep
15 0x000000000004920c ucs_arbiter_dispatch_nonempty()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/ucs/datastr
16 0x0000000000043c0d ucs_arbiter_dispatch()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/ucs/datastruct/arbit
17 0x0000000000043c0d uct_ud_iface_progress_pending()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/uct/ib/ud/b
18 0x0000000000043c0d uct_ud_mlx5_iface_async_progress()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/uct/ib/u
19 0x000000000003b1ff uct_ud_iface_async_progress()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/uct/ib/ud/bas
20 0x000000000003b1ff uct_ud_iface_timer()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/uct/ib/ud/base/ud_ifac
21 0x00000000000415e1 ucs_async_handler_dispatch()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/ucs/async/asyn
22 0x00000000000415e1 ucs_async_dispatch_handlers()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/ucs/async/asy
23 0x00000000000417d1 ucs_async_dispatch_timerq()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/ucs/async/async
24 0x0000000000043f74 ucs_async_thread_func()  /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.7.x/src/ucs/async/thread.c:
25 0x0000000000007e65 start_thread()  pthread_create.c:0
26 0x00000000000fe88d __clone()  ???:0

Setup and versions

Microsoft HBv2

angainor commented 4 years ago

@bureddy @yosefe I see this quite often running OSU benchmarks with HPCX 2.6 on an EDR system with AMD cpus. I sometimes see this error when I set UCX_TLS=self,shm,rc. Do you have any ideas?

time mpirun -mca oob_tcp_if_include ib0 --map-by core --bind-to core  -mca pml ucx -x UCX_TLS=self,shm,rc -mca coll_hcoll_enable 1 --mca coll hcoll,libnbc,tuned,basic osu_barrier -f 
[b1232:77724:0:78357] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x580)
==== backtrace (tid:  78357) ====
 0 0x0000000000050b95 ucs_debug_print_backtrace()  /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/debug/debug.c:625
 1 0x00000000000392ad __GI_getenv()  :0
 2 0x0000000000030e1e __dcigettext()  :0
 3 0x000000000008cc3e __GI___strerror_r()  :0
 4 0x000000000008cb7f strerror()  ???:0
 5 0x0000000000059f88 ucs_sysv_shmget_format_error()  /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/sys/sys.c:702
 6 0x0000000000059f88 ucs_sysv_alloc()  /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/sys/sys.c:768
 7 0x000000000000ef5f uct_mem_alloc()  /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/uct/base/uct_mem.c:224
 8 0x000000000000f36b uct_iface_mem_alloc()  /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/uct/base/uct_mem.c:290
 9 0x000000000000f453 uct_iface_mp_chunk_alloc()  /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/uct/base/uct_mem.c:356
10 0x000000000004bd4c ucs_mpool_grow()  /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/datastruct/mpool.c:189
11 0x000000000004bef0 ucs_mpool_get_grow()  /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/datastruct/mpool.c:237
12 0x0000000000048756 ucs_mpool_get_inline()  /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/datastruct/mpool.inl:23
13 0x0000000000048ada uct_ud_mlx5_iface_poll_rx()  /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/uct/ib/ud/accel/ud_mlx5.c:440
14 0x000000000004014f uct_ud_iface_async_progress()  /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/uct/ib/ud/base/ud_iface.c:864
15 0x000000000004014f uct_ud_iface_timer()  /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/uct/ib/ud/base/ud_iface.c:879
16 0x0000000000041ffa ucs_async_handler_invoke()  /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/async/async.c:224
17 0x0000000000041ffa ucs_async_dispatch_handlers()  /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/async/async.c:271
18 0x00000000000421f1 ucs_async_dispatch_timerq()  /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/async/async.c:298
19 0x00000000000449f4 ucs_async_thread_func()  /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/async/thread.c:138
20 0x0000000000007ea5 start_thread()  pthread_create.c:0
21 0x00000000000fe8cd __clone()  ???:0
=================================
angainor commented 4 years ago

These are followed by some hcoll init errors:

[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1589305033.243797] [b3143:198226:0]       mm_posix.c:195  UCX  ERROR open(file_name=/proc/198207/fd/22 flags=0x0) failed: No such file or directory
[b3143.betzy.sigma2.no:198226] pml_ucx.c:780  Error: ucx send failed: Shared memory error
[LOG_CAT_COMMPATTERNS]   isend failed in  comm_allreduce_pml at iterations 4

[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[b3143.betzy.sigma2.no:198177] Error: coll_hcoll_module.c:301 - mca_coll_hcoll_comm_query() Hcol library init failed
jhgoebbert commented 4 weeks ago

I would expect that the shmget system-call ( https://man7.org/linux/man-pages/man2/shmget.2.html ) fails. So no shared memory could get allocated for some reason: https://github.com/openucx/ucx/blob/v1.17.x/src/ucs/sys/sys.c#L900