Open bureddy opened 4 years ago
@bureddy @yosefe I see this quite often running OSU benchmarks with HPCX 2.6 on an EDR system with AMD cpus. I sometimes see this error when I set UCX_TLS=self,shm,rc
. Do you have any ideas?
time mpirun -mca oob_tcp_if_include ib0 --map-by core --bind-to core -mca pml ucx -x UCX_TLS=self,shm,rc -mca coll_hcoll_enable 1 --mca coll hcoll,libnbc,tuned,basic osu_barrier -f
[b1232:77724:0:78357] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x580)
==== backtrace (tid: 78357) ====
0 0x0000000000050b95 ucs_debug_print_backtrace() /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/debug/debug.c:625
1 0x00000000000392ad __GI_getenv() :0
2 0x0000000000030e1e __dcigettext() :0
3 0x000000000008cc3e __GI___strerror_r() :0
4 0x000000000008cb7f strerror() ???:0
5 0x0000000000059f88 ucs_sysv_shmget_format_error() /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/sys/sys.c:702
6 0x0000000000059f88 ucs_sysv_alloc() /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/sys/sys.c:768
7 0x000000000000ef5f uct_mem_alloc() /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/uct/base/uct_mem.c:224
8 0x000000000000f36b uct_iface_mem_alloc() /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/uct/base/uct_mem.c:290
9 0x000000000000f453 uct_iface_mp_chunk_alloc() /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/uct/base/uct_mem.c:356
10 0x000000000004bd4c ucs_mpool_grow() /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/datastruct/mpool.c:189
11 0x000000000004bef0 ucs_mpool_get_grow() /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/datastruct/mpool.c:237
12 0x0000000000048756 ucs_mpool_get_inline() /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/datastruct/mpool.inl:23
13 0x0000000000048ada uct_ud_mlx5_iface_poll_rx() /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/uct/ib/ud/accel/ud_mlx5.c:440
14 0x000000000004014f uct_ud_iface_async_progress() /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/uct/ib/ud/base/ud_iface.c:864
15 0x000000000004014f uct_ud_iface_timer() /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/uct/ib/ud/base/ud_iface.c:879
16 0x0000000000041ffa ucs_async_handler_invoke() /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/async/async.c:224
17 0x0000000000041ffa ucs_async_dispatch_handlers() /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/async/async.c:271
18 0x00000000000421f1 ucs_async_dispatch_timerq() /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/async/async.c:298
19 0x00000000000449f4 ucs_async_thread_func() /build-result/src/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.8.x/src/ucs/async/thread.c:138
20 0x0000000000007ea5 start_thread() pthread_create.c:0
21 0x00000000000fe8cd __clone() ???:0
=================================
These are followed by some hcoll init errors:
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1589305033.243797] [b3143:198226:0] mm_posix.c:195 UCX ERROR open(file_name=/proc/198207/fd/22 flags=0x0) failed: No such file or directory
[b3143.betzy.sigma2.no:198226] pml_ucx.c:780 Error: ucx send failed: Shared memory error
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[b3143.betzy.sigma2.no:198177] Error: coll_hcoll_module.c:301 - mca_coll_hcoll_comm_query() Hcol library init failed
I would expect that the shmget
system-call ( https://man7.org/linux/man-pages/man2/shmget.2.html ) fails. So no shared memory could get allocated for some reason:
https://github.com/openucx/ucx/blob/v1.17.x/src/ucs/sys/sys.c#L900
seen it only once. not able to reproduce.
Setup and versions
Microsoft HBv2