Open shimmybalsam opened 2 years ago
@shimmybalsam can you pls post the full output?
@yosefe full error output:
mpirun -x UCC_CL_BASIC_TLS=ucp -x UCC_CLS=basic --bind-to core -x UCC_TL_UCP_TUNE=inf -x LD_LIBRARY_PATH=/global/home/users/sbalsam/ucc_shm/lib:/global/home/users/sbalsam/ucx/build/install/lib:$LD_LIBRARY_PATH -np 28 --map-by core ucc_test_mpi --colls allreduce --mtypes host -m 4:16384 --onesided 0
[1657713098.977001] [helios017:1959767:0] wireup.c:1087 UCX ERROR old: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe
[1657713098.977038] [helios017:1959767:0] wireup.c:1097 UCX ERROR old: lane[0]: 7:sysv/memory.0 md[2] -> md[2]/sysv/sysdev[255] am am_bw#0
[1657713098.977044] [helios017:1959767:0] wireup.c:1097 UCX ERROR old: lane[1]: 31:xpmem/memory.0 md[10] -> md[10]/xpmem/sysdev[255] rkey_ptr
[1657713098.977048] [helios017:1959767:0] wireup.c:1097 UCX ERROR old: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6] -> md[4]/ib/sysdev[255] rma_bw#0
[1657713098.977052] [helios017:1959767:0] wireup.c:1097 UCX ERROR old: lane[3]: 11:rc_mlx5/mlx5_0:1.0 md[4] -> md[6]/ib/sysdev[255] rma_bw#1 wireup
[1657713098.977056] [helios017:1959767:0] wireup.c:1087 UCX ERROR new: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe
[1657713098.977060] [helios017:1959767:0] wireup.c:1097 UCX ERROR new: lane[0]: 7:sysv/memory.0 md[2] -> md[2]/sysv/sysdev[255] am am_bw#0
[1657713098.977065] [helios017:1959767:0] wireup.c:1097 UCX ERROR new: lane[1]: 31:xpmem/memory.0 md[10] -> md[10]/xpmem/sysdev[255] rkey_ptr
[1657713098.977069] [helios017:1959767:0] wireup.c:1097 UCX ERROR new: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6] -> md[4]/ib/sysdev[255] rma_bw#0
[1657713098.977073] [helios017:1959767:0] wireup.c:1097 UCX ERROR new: lane[3]: 26:rc_mlx5/mlx5_3:1.0 md[7] -> md[5]/ib/sysdev[255] rma_bw#1 wireup
[helios017:1959767:0:1959767] wireup.c:1388 Fatal: endpoint reconfiguration not supported yet
[1657713098.979514] [helios017:1959759:0] wireup.c:1087 UCX ERROR old: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe
[1657713098.979538] [helios017:1959759:0] wireup.c:1097 UCX ERROR old: lane[0]: 7:sysv/memory.0 md[2] -> md[2]/sysv/sysdev[255] am am_bw#0
[1657713098.979544] [helios017:1959759:0] wireup.c:1097 UCX ERROR old: lane[1]: 31:xpmem/memory.0 md[10] -> md[10]/xpmem/sysdev[255] rkey_ptr
[1657713098.979548] [helios017:1959759:0] wireup.c:1097 UCX ERROR old: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6] -> md[4]/ib/sysdev[255] rma_bw#0
[1657713098.979552] [helios017:1959759:0] wireup.c:1097 UCX ERROR old: lane[3]: 11:rc_mlx5/mlx5_0:1.0 md[4] -> md[6]/ib/sysdev[255] rma_bw#1 wireup
[1657713098.979556] [helios017:1959759:0] wireup.c:1087 UCX ERROR new: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe
[1657713098.979560] [helios017:1959759:0] wireup.c:1097 UCX ERROR new: lane[0]: 7:sysv/memory.0 md[2] -> md[2]/sysv/sysdev[255] am am_bw#0
[1657713098.979565] [helios017:1959759:0] wireup.c:1097 UCX ERROR new: lane[1]: 31:xpmem/memory.0 md[10] -> md[10]/xpmem/sysdev[255] rkey_ptr
[1657713098.979568] [helios017:1959759:0] wireup.c:1097 UCX ERROR new: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6] -> md[4]/ib/sysdev[255] rma_bw#0
[1657713098.979571] [helios017:1959759:0] wireup.c:1097 UCX ERROR new: lane[3]: 26:rc_mlx5/mlx5_3:1.0 md[7] -> md[5]/ib/sysdev[255] rma_bw#1 wireup
[helios017:1959759:0:1959759] wireup.c:1388 Fatal: endpoint reconfiguration not supported yet
[1657713098.980056] [helios017:1959753:0] wireup.c:1087 UCX ERROR old: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe
[1657713098.980080] [helios017:1959753:0] wireup.c:1097 UCX ERROR old: lane[0]: 7:sysv/memory.0 md[2] -> md[2]/sysv/sysdev[255] am am_bw#0
[1657713098.980086] [helios017:1959753:0] wireup.c:1097 UCX ERROR old: lane[1]: 31:xpmem/memory.0 md[10] -> md[10]/xpmem/sysdev[255] rkey_ptr
[1657713098.980090] [helios017:1959753:0] wireup.c:1097 UCX ERROR old: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6] -> md[4]/ib/sysdev[255] rma_bw#0
[1657713098.980095] [helios017:1959753:0] wireup.c:1097 UCX ERROR old: lane[3]: 11:rc_mlx5/mlx5_0:1.0 md[4] -> md[6]/ib/sysdev[255] rma_bw#1 wireup
[1657713098.980098] [helios017:1959753:0] wireup.c:1087 UCX ERROR new: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe
[1657713098.980103] [helios017:1959753:0] wireup.c:1097 UCX ERROR new: lane[0]: 7:sysv/memory.0 md[2] -> md[2]/sysv/sysdev[255] am am_bw#0
[1657713098.980107] [helios017:1959753:0] wireup.c:1097 UCX ERROR new: lane[1]: 31:xpmem/memory.0 md[10] -> md[10]/xpmem/sysdev[255] rkey_ptr
[1657713098.980110] [helios017:1959753:0] wireup.c:1097 UCX ERROR new: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6] -> md[4]/ib/sysdev[255] rma_bw#0
[1657713098.980113] [helios017:1959753:0] wireup.c:1097 UCX ERROR new: lane[3]: 26:rc_mlx5/mlx5_3:1.0 md[7] -> md[5]/ib/sysdev[255] rma_bw#1 wireup
[helios017:1959753:0:1959753] wireup.c:1388 Fatal: endpoint reconfiguration not supported yet
[1657713098.993884] [helios017:1959773:a] wireup.c:1087 UCX ERROR old: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe
[1657713098.993913] [helios017:1959773:a] wireup.c:1097 UCX ERROR old: lane[0]: 7:sysv/memory.0 md[2] -> md[2]/sysv/sysdev[255] am am_bw#0
[1657713098.993917] [helios017:1959773:a] wireup.c:1097 UCX ERROR old: lane[1]: 31:xpmem/memory.0 md[10] -> md[10]/xpmem/sysdev[255] rkey_ptr
[1657713098.993921] [helios017:1959773:a] wireup.c:1097 UCX ERROR old: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6] -> md[4]/ib/sysdev[255] rma_bw#0
[1657713098.993924] [helios017:1959773:a] wireup.c:1097 UCX ERROR old: lane[3]: 11:rc_mlx5/mlx5_0:1.0 md[4] -> md[6]/ib/sysdev[255] rma_bw#1 wireup
[1657713098.993928] [helios017:1959773:a] wireup.c:1087 UCX ERROR new: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe
[1657713098.993931] [helios017:1959773:a] wireup.c:1097 UCX ERROR new: lane[0]: 7:sysv/memory.0 md[2] -> md[2]/sysv/sysdev[255] am am_bw#0
[1657713098.993934] [helios017:1959773:a] wireup.c:1097 UCX ERROR new: lane[1]: 31:xpmem/memory.0 md[10] -> md[10]/xpmem/sysdev[255] rkey_ptr
[1657713098.993937] [helios017:1959773:a] wireup.c:1097 UCX ERROR new: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6] -> md[4]/ib/sysdev[255] rma_bw#0
[1657713098.993940] [helios017:1959773:a] wireup.c:1097 UCX ERROR new: lane[3]: 26:rc_mlx5/mlx5_3:1.0 md[7] -> md[5]/ib/sysdev[255] rma_bw#1 wireup
[helios017:1959773:a:1959833] wireup.c:1388 Fatal: endpoint reconfiguration not supported yet
[1657713099.000867] [helios017:1959769:a] wireup.c:1087 UCX ERROR old: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe
[1657713099.000897] [helios017:1959769:a] wireup.c:1097 UCX ERROR old: lane[0]: 7:sysv/memory.0 md[2] -> md[2]/sysv/sysdev[255] am am_bw#0
[1657713099.000901] [helios017:1959769:a] wireup.c:1097 UCX ERROR old: lane[1]: 31:xpmem/memory.0 md[10] -> md[10]/xpmem/sysdev[255] rkey_ptr
[1657713099.000905] [helios017:1959769:a] wireup.c:1097 UCX ERROR old: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6] -> md[4]/ib/sysdev[255] rma_bw#0
[1657713099.000909] [helios017:1959769:a] wireup.c:1097 UCX ERROR old: lane[3]: 11:rc_mlx5/mlx5_0:1.0 md[4] -> md[6]/ib/sysdev[255] rma_bw#1 wireup
[1657713099.000912] [helios017:1959769:a] wireup.c:1087 UCX ERROR new: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe
[1657713099.000915] [helios017:1959769:a] wireup.c:1097 UCX ERROR new: lane[0]: 7:sysv/memory.0 md[2] -> md[2]/sysv/sysdev[255] am am_bw#0
[1657713099.000918] [helios017:1959769:a] wireup.c:1097 UCX ERROR new: lane[1]: 31:xpmem/memory.0 md[10] -> md[10]/xpmem/sysdev[255] rkey_ptr
[1657713099.000921] [helios017:1959769:a] wireup.c:1097 UCX ERROR new: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6] -> md[4]/ib/sysdev[255] rma_bw#0
[1657713099.000924] [helios017:1959769:a] wireup.c:1097 UCX ERROR new: lane[3]: 26:rc_mlx5/mlx5_3:1.0 md[7] -> md[5]/ib/sysdev[255] rma_bw#1 wireup
[helios017:1959769:a:1959834] wireup.c:1388 Fatal: endpoint reconfiguration not supported yet
/global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c: [ ucp_wireup_init_lanes() ]
...
1385 NULL, cm_idx, UCS_LOG_LEVEL_ERROR);
1386 ucp_wireup_print_config(worker, &key, "new", NULL,
1387 cm_idx, UCS_LOG_LEVEL_ERROR);
==> 1388 ucs_fatal("endpoint reconfiguration not supported yet");
1389 }
1390
1391 ep->cfg_index = new_cfg_index;
/global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c: [ ucp_wireup_init_lanes() ]
...
1385 NULL, cm_idx, UCS_LOG_LEVEL_ERROR);
1386 ucp_wireup_print_config(worker, &key, "new", NULL,
1387 cm_idx, UCS_LOG_LEVEL_ERROR);
==> 1388 ucs_fatal("endpoint reconfiguration not supported yet");
1389 }
1390
1391 ep->cfg_index = new_cfg_index;
/global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c: [ ucp_wireup_init_lanes() ]
...
1385 NULL, cm_idx, UCS_LOG_LEVEL_ERROR);
1386 ucp_wireup_print_config(worker, &key, "new", NULL,
1387 cm_idx, UCS_LOG_LEVEL_ERROR);
==> 1388 ucs_fatal("endpoint reconfiguration not supported yet");
1389 }
1390
1391 ep->cfg_index = new_cfg_index;
/global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c: [ ucp_wireup_init_lanes() ]
...
/global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c: [ ucp_wireup_init_lanes() ]
...
1385 NULL, cm_idx, UCS_LOG_LEVEL_ERROR);
1386 ucp_wireup_print_config(worker, &key, "new", NULL,
1387 cm_idx, UCS_LOG_LEVEL_ERROR);
==> 1388 ucs_fatal("endpoint reconfiguration not supported yet");
1389 }
1390
1391 ep->cfg_index = new_cfg_index;
1385 NULL, cm_idx, UCS_LOG_LEVEL_ERROR);
1386 ucp_wireup_print_config(worker, &key, "new", NULL,
1387 cm_idx, UCS_LOG_LEVEL_ERROR);
==> 1388 ucs_fatal("endpoint reconfiguration not supported yet");
1389 }
1390
1391 ep->cfg_index = new_cfg_index;
==== backtrace (tid:1959833) ====
0 0x000000000009f169 ucp_wireup_init_lanes() /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:1388
1 0x000000000009d09a ucp_wireup_process_request() /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:583
2 0x000000000009d9a1 ucp_wireup_msg_handler() /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:836
3 0x000000000005d7a1 uct_iface_invoke_am() /global/home/users/sbalsam/ucx/contrib/../src/uct/base/uct_iface.h:878
4 0x000000000005d7a1 uct_ud_ep_process_rx() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_ep.c:1049
5 0x0000000000066ac2 uct_ud_mlx5_iface_poll_rx() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:510
6 0x0000000000066ac2 uct_ud_mlx5_iface_async_progress() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:586
7 0x0000000000057fdb uct_ud_iface_async_progress() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_iface.c:256
8 0x0000000000057fdb uct_ud_iface_async_handler() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_iface.c:267
9 0x000000000004bdac ucs_async_handler_invoke() /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/async.c:252
10 0x000000000004bdac ucs_async_handler_dispatch() /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/async.c:274
11 0x000000000004bf75 ucs_async_dispatch_handlers() /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/async.c:306
12 0x000000000004eb26 ucs_async_thread_ev_handler() /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/thread.c:88
13 0x000000000006ba53 ucs_event_set_wait() /global/home/users/sbalsam/ucx/contrib/../src/ucs/sys/event_set.c:215
14 0x000000000004ec6c ucs_async_thread_func() /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/thread.c:131
15 0x00000000000081cf start_thread() ???:0
16 0x0000000000039d83 __GI___clone() :0
=================================
[helios017:1959773] *** Process received signal ***
[helios017:1959773] Signal: Aborted (6)
[helios017:1959773] Signal code: (-6)
[helios017:1959773] [ 0] /lib64/libpthread.so.0(+0x12ce0)[0x154b4d4c3ce0]
[helios017:1959773] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x154b4c305a4f]
[helios017:1959773] [ 2] /lib64/libc.so.6(abort+0x127)[0x154b4c2d8db5]
[helios017:1959773] [ 3] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e345)[0x154b4dd57345]
[helios017:1959773] [ 4] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e419)[0x154b4dd57419]
[helios017:1959773] [ 5] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(ucp_wireup_init_lanes+0x359)[0x154b3fbcb169]
[helios017:1959773] [ 6] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d09a)[0x154b3fbc909a]
[helios017:1959773] [ 7] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d9a1)[0x154b3fbc99a1]
[helios017:1959773] [ 8] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x1f1)[0x154b3ee887a1]
[helios017:1959773] [ 9] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(+0x66ac2)[0x154b3ee91ac2]
[helios017:1959773] [10] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(+0x57fdb)[0x154b3ee82fdb]
[helios017:1959773] [11] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x4bdac)[0x154b4dd44dac]
[helios017:1959773] [12] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(ucs_async_dispatch_handlers+0xe5)[0x154b4dd44f75]
[helios017:1959773] [13] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x4eb26)[0x154b4dd47b26]
[helios017:1959773] [14] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(ucs_event_set_wait+0xa3)[0x154b4dd64a53]
[helios017:1959773] [15] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x4ec6c)[0x154b4dd47c6c]
[helios017:1959773] [16] /lib64/libpthread.so.0(+0x81cf)[0x154b4d4b91cf]
[helios017:1959773] [17] /lib64/libc.so.6(clone+0x43)[0x154b4c2f0d83]
[helios017:1959773] *** End of error message ***
==== backtrace (tid:1959834) ====
0 0x000000000009f169 ucp_wireup_init_lanes() /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:1388
1 0x000000000009d09a ucp_wireup_process_request() /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:583
2 0x000000000009d9a1 ucp_wireup_msg_handler() /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:836
3 0x000000000005d7a1 uct_iface_invoke_am() /global/home/users/sbalsam/ucx/contrib/../src/uct/base/uct_iface.h:878
4 0x000000000005d7a1 uct_ud_ep_process_rx() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_ep.c:1049
5 0x0000000000066ac2 uct_ud_mlx5_iface_poll_rx() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:510
6 0x0000000000066ac2 uct_ud_mlx5_iface_async_progress() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:586
7 0x000000000005804b uct_ud_iface_async_progress() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_iface.c:256
8 0x000000000005804b uct_ud_iface_timer() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_iface.c:286
9 0x000000000004bdac ucs_async_handler_invoke() /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/async.c:252
10 0x000000000004bdac ucs_async_handler_dispatch() /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/async.c:274
11 0x000000000004bf75 ucs_async_dispatch_handlers() /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/async.c:306
12 0x000000000004c130 ucs_async_dispatch_timerq() /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/async.c:333
13 0x000000000004ecb5 ucs_async_thread_func() /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/thread.c:142
14 0x00000000000081cf start_thread() ???:0
15 0x0000000000039d83 __GI___clone() :0
=================================
[helios017:1959769] *** Process received signal ***
[helios017:1959769] Signal: Aborted (6)
[helios017:1959769] Signal code: (-6)
[helios017:1959769] [ 0] /lib64/libpthread.so.0(+0x12ce0)[0x1477964b8ce0]
[helios017:1959769] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x1477952faa4f]
[helios017:1959769] [ 2] /lib64/libc.so.6(abort+0x127)[0x1477952cddb5]
[helios017:1959769] [ 3] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e345)[0x147796d4c345]
[helios017:1959769] [ 4] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e419)[0x147796d4c419]
[helios017:1959769] [ 5] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(ucp_wireup_init_lanes+0x359)[0x14778cd57169]
[helios017:1959769] [ 6] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d09a)[0x14778cd5509a]
[helios017:1959769] [ 7] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d9a1)[0x14778cd559a1]
[helios017:1959769] [ 8] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x1f1)[0x147787dd47a1]
[helios017:1959769] [ 9] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(+0x66ac2)[0x147787dddac2]
[helios017:1959769] [10] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(+0x5804b)[0x147787dcf04b]
[helios017:1959769] [11] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x4bdac)[0x147796d39dac]
[helios017:1959769] [12] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(ucs_async_dispatch_handlers+0xe5)[0x147796d39f75]
[helios017:1959769] [13] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(ucs_async_dispatch_timerq+0xd0)[0x147796d3a130]
[helios017:1959769] [14] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x4ecb5)[0x147796d3ccb5]
[helios017:1959769] [15] /lib64/libpthread.so.0(+0x81cf)[0x1477964ae1cf]
[helios017:1959769] [16] /lib64/libc.so.6(clone+0x43)[0x1477952e5d83]
[helios017:1959769] *** End of error message ***
==== backtrace (tid:1959767) ====
0 0x000000000009f169 ucp_wireup_init_lanes() /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:1388
1 0x000000000009d09a ucp_wireup_process_request() /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:583
2 0x000000000009d9a1 ucp_wireup_msg_handler() /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:836
3 0x000000000005d7a1 uct_iface_invoke_am() /global/home/users/sbalsam/ucx/contrib/../src/uct/base/uct_iface.h:878
4 0x000000000005d7a1 uct_ud_ep_process_rx() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_ep.c:1049
5 0x0000000000066517 uct_ud_mlx5_iface_poll_rx() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:510
6 0x0000000000066517 uct_ud_mlx5_iface_progress() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:567
7 0x0000000000046422 ucs_callbackq_dispatch() /global/home/users/sbalsam/ucx/contrib/../src/ucs/datastruct/callbackq.h:211
8 0x0000000000046422 uct_worker_progress() /global/home/users/sbalsam/ucx/contrib/../src/uct/api/uct.h:2647
9 0x0000000000046422 ucp_worker_progress() /global/home/users/sbalsam/ucx/contrib/../src/ucp/core/ucp_worker.c:2804
10 0x0000000000016008 ucc_tl_ucp_test() /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/components/tl/ucp/../../../../../src/components/tl/ucp/allreduce/../tl_ucp_coll.h:266
11 0x0000000000016008 ucc_tl_ucp_allreduce_knomial_progress() /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/components/tl/ucp/../../../../../src/components/tl/ucp/allreduce/allreduce_knomial.c:129
12 0x000000000000f60b ucc_pq_st_progress() /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_progress_queue_st.c:31
13 0x000000000000a7ae ucc_progress_queue() /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_progress_queue.h:46
14 0x00000000004077ea UccTestMpi::create_ucc_team() /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:214
15 0x000000000040964a UccTestMpi::create_teams() /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:234
16 0x000000000040964a UccTestMpi::create_teams() /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:156
17 0x0000000000405575 main() /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/main.cc:522
18 0x000000000003aca3 __libc_start_main() ???:0
19 0x000000000040611f _start() ???:0
=================================
[helios017:1959767] *** Process received signal ***
[helios017:1959767] Signal: Aborted (6)
[helios017:1959767] Signal code: (-6)
[helios017:1959767] [ 0] /lib64/libpthread.so.0(+0x12ce0)[0x14c9ac6b8ce0]
[helios017:1959767] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x14c9ab4faa4f]
[helios017:1959767] [ 2] /lib64/libc.so.6(abort+0x127)[0x14c9ab4cddb5]
[helios017:1959767] [ 3] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e345)[0x14c9acf4c345]
[helios017:1959767] [ 4] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e419)[0x14c9acf4c419]
[helios017:1959767] [ 5] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(ucp_wireup_init_lanes+0x359)[0x14c99ed9f169]
[helios017:1959767] [ 6] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d09a)[0x14c99ed9d09a]
[helios017:1959767] [ 7] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d9a1)[0x14c99ed9d9a1]
[helios017:1959767] [ 8] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x1f1)[0x14c99e05c7a1]
[helios017:1959767] [ 9] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(+0x66517)[0x14c99e065517]
[helios017:1959767] [10] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(ucp_worker_progress+0x22)[0x14c99ed46422]
[helios017:1959767] [11] /global/home/users/sbalsam/ucc_shm/lib/ucc/libucc_tl_ucp.so(ucc_tl_ucp_allreduce_knomial_progress+0x488)[0x14c95a3dd008]
[helios017:1959767] [12] /global/home/users/sbalsam/ucc_shm/lib/libucc.so.1(+0xf60b)[0x14c9ad2a060b]
[helios017:1959767] [13] /global/home/users/sbalsam/ucc_shm/lib/libucc.so.1(ucc_context_progress+0x3e)[0x14c9ad29b7ae]
[helios017:1959767] [14] ucc_test_mpi[0x4077ea]
[helios017:1959767] [15] ucc_test_mpi[0x40964a]
[helios017:1959767] [16] ucc_test_mpi[0x405575]
[helios017:1959767] [17] /lib64/libc.so.6(__libc_start_main+0xf3)[0x14c9ab4e6ca3]
[helios017:1959767] [18] ucc_test_mpi[0x40611f]
[helios017:1959767] *** End of error message ***
==== backtrace (tid:1959759) ====
0 0x000000000009f169 ucp_wireup_init_lanes() /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:1388
1 0x000000000009d09a ucp_wireup_process_request() /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:583
2 0x000000000009d9a1 ucp_wireup_msg_handler() /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:836
3 0x000000000005d7a1 uct_iface_invoke_am() /global/home/users/sbalsam/ucx/contrib/../src/uct/base/uct_iface.h:878
4 0x000000000005d7a1 uct_ud_ep_process_rx() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_ep.c:1049
5 0x0000000000066517 uct_ud_mlx5_iface_poll_rx() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:510
6 0x0000000000066517 uct_ud_mlx5_iface_progress() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:567
7 0x0000000000056553 ucs_callbackq_slow_proxy() /global/home/users/sbalsam/ucx/contrib/../src/ucs/datastruct/callbackq.c:404
8 0x0000000000046422 ucs_callbackq_dispatch() /global/home/users/sbalsam/ucx/contrib/../src/ucs/datastruct/callbackq.h:211
9 0x0000000000046422 uct_worker_progress() /global/home/users/sbalsam/ucx/contrib/../src/uct/api/uct.h:2647
10 0x0000000000046422 ucp_worker_progress() /global/home/users/sbalsam/ucx/contrib/../src/ucp/core/ucp_worker.c:2804
11 0x0000000000016008 ucc_tl_ucp_test() /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/components/tl/ucp/../../../../../src/components/tl/ucp/allreduce/../tl_ucp_coll.h:266
12 0x0000000000016008 ucc_tl_ucp_allreduce_knomial_progress() /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/components/tl/ucp/../../../../../src/components/tl/ucp/allreduce/allreduce_knomial.c:129
13 0x000000000000f60b ucc_pq_st_progress() /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_progress_queue_st.c:31
14 0x000000000000a7ae ucc_progress_queue() /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_progress_queue.h:46
15 0x000000000000ce14 ucc_team_alloc_id() /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_team.c:581
16 0x000000000000ce14 ucc_team_create_test_single() /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_team.c:418
17 0x0000000000407803 UccTestMpi::create_ucc_team() /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:213
18 0x000000000040964a UccTestMpi::create_teams() /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:234
19 0x000000000040964a UccTestMpi::create_teams() /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:156
20 0x0000000000405575 main() /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/main.cc:522
21 0x000000000003aca3 __libc_start_main() ???:0
22 0x000000000040611f _start() ???:0
=================================
[helios017:1959759] *** Process received signal ***
[helios017:1959759] Signal: Aborted (6)
[helios017:1959759] Signal code: (-6)
[helios017:1959759] [ 0] /lib64/libpthread.so.0(+0x12ce0)[0x15099bde5ce0]
[helios017:1959759] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x15099ac27a4f]
[helios017:1959759] [ 2] /lib64/libc.so.6(abort+0x127)[0x15099abfadb5]
[helios017:1959759] [ 3] ==== backtrace (tid:1959753) ====
0 0x000000000009f169 ucp_wireup_init_lanes() /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:1388
/global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e345)[0x15099c679345]
[helios017:1959759] [ 4] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e419)[0x15099c679419]
1 0x000000000009d09a ucp_wireup_process_request() /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:583
2 0x000000000009d9a1 ucp_wireup_msg_handler() /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:836
3 0x000000000005d7a1 uct_iface_invoke_am() /global/home/users/sbalsam/ucx/contrib/../src/uct/base/uct_iface.h:878
4 0x000000000005d7a1 uct_ud_ep_process_rx() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_ep.c:1049
5 0x0000000000066517 uct_ud_mlx5_iface_poll_rx() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:510
6 0x0000000000066517 uct_ud_mlx5_iface_progress() /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:567
7 0x0000000000056553 ucs_callbackq_slow_proxy() /global/home/users/sbalsam/ucx/contrib/../src/ucs/datastruct/callbackq.c:404
8 0x0000000000046422 ucs_callbackq_dispatch() /global/home/users/sbalsam/ucx/contrib/../src/ucs/datastruct/callbackq.h:211
9 0x0000000000046422 uct_worker_progress() /global/home/users/sbalsam/ucx/contrib/../src/uct/api/uct.h:2647
10 0x0000000000046422 ucp_worker_progress() /global/home/users/sbalsam/ucx/contrib/../src/ucp/core/ucp_worker.c:2804
[helios017:1959759] [ 5] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(ucp_wireup_init_lanes+0x359)[0x15098e574169]
[helios017:1959759] [ 6] 11 0x0000000000016008 ucc_tl_ucp_test() /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/components/tl/ucp/../../../../../src/components/tl/ucp/allreduce/../tl_ucp_coll.h:266
12 0x0000000000016008 ucc_tl_ucp_allreduce_knomial_progress() /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/components/tl/ucp/../../../../../src/components/tl/ucp/allreduce/allreduce_knomial.c:129
13 0x000000000000f60b ucc_pq_st_progress() /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_progress_queue_st.c:31
14 0x000000000000a7ae ucc_progress_queue() /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_progress_queue.h:46
15 0x000000000000ce14 ucc_team_alloc_id() /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_team.c:581
16 0x000000000000ce14 ucc_team_create_test_single() /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_team.c:418
17 0x0000000000407803 UccTestMpi::create_ucc_team() /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:213
18 0x000000000040964a UccTestMpi::create_teams() /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:234
19 0x000000000040964a UccTestMpi::create_teams() /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:156
20 0x0000000000405575 main() /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/main.cc:522
21 0x000000000003aca3 __libc_start_main() ???:0
22 0x000000000040611f _start() ???:0
=================================
/global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d09a)[0x15098e57209a]
[helios017:1959759] [ 7] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d9a1)[0x15098e5729a1]
[helios017:1959759] [ 8] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x1f1)[0x15098d8317a1]
[helios017:1959759] [ 9] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(+0x66517)[0x15098d83a517]
[helios017:1959759] [10] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x56553)[0x15099c671553]
[helios017:1959759] [11] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(ucp_worker_progress+0x22)[0x15098e51b422]
[helios017:1959759] [12] /global/home/users/sbalsam/ucc_shm/lib/ucc/libucc_tl_ucp.so(ucc_tl_ucp_allreduce_knomial_progress+0x488)[0x1509515f8008]
[helios017:1959759] [13] /global/home/users/sbalsam/ucc_shm/lib/libucc.so.1(+0xf60b)[0x15099c9cd60b]
[helios017:1959759] [14] /global/home/users/sbalsam/ucc_shm/lib/libucc.so.1(ucc_context_progress+0x3e)[0x15099c9c87ae]
[helios017:1959753] *** Process received signal ***
[helios017:1959753] Signal: Aborted (6)
[helios017:1959753] Signal code: (-6)
[helios017:1959759] [15] /global/home/users/sbalsam/ucc_shm/lib/libucc.so.1(ucc_team_create_test_single+0x454)[0x15099c9cae14]
[helios017:1959759] [16] ucc_test_mpi[0x407803]
[helios017:1959759] [17] ucc_test_mpi[0x40964a]
[helios017:1959759] [18] ucc_test_mpi[0x405575]
[helios017:1959759] [19] /lib64/libc.so.6(__libc_start_main+0xf3)[0x15099ac13ca3]
[helios017:1959759] [20] ucc_test_mpi[0x40611f]
[helios017:1959759] *** End of error message ***
[helios017:1959753] [ 0] /lib64/libpthread.so.0(+0x12ce0)[0x14a6e2dc5ce0]
[helios017:1959753] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x14a6e1c07a4f]
[helios017:1959753] [ 2] /lib64/libc.so.6(abort+0x127)[0x14a6e1bdadb5]
[helios017:1959753] [ 3] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e345)[0x14a6e3659345]
[helios017:1959753] [ 4] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e419)[0x14a6e3659419]
[helios017:1959753] [ 5] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(ucp_wireup_init_lanes+0x359)[0x14a6d9664169]
[helios017:1959753] [ 6] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d09a)[0x14a6d966209a]
[helios017:1959753] [ 7] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d9a1)[0x14a6d96629a1]
[helios017:1959753] [ 8] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x1f1)[0x14a6d89217a1]
[helios017:1959753] [ 9] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(+0x66517)[0x14a6d892a517]
[helios017:1959753] [10] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x56553)[0x14a6e3651553]
[helios017:1959753] [11] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(ucp_worker_progress+0x22)[0x14a6d960b422]
[helios017:1959753] [12] /global/home/users/sbalsam/ucc_shm/lib/ucc/libucc_tl_ucp.so(ucc_tl_ucp_allreduce_knomial_progress+0x488)[0x14a694082008]
[helios017:1959753] [13] /global/home/users/sbalsam/ucc_shm/lib/libucc.so.1(+0xf60b)[0x14a6e39ad60b]
[helios017:1959753] [14] /global/home/users/sbalsam/ucc_shm/lib/libucc.so.1(ucc_context_progress+0x3e)[0x14a6e39a87ae]
[helios017:1959753] [15] /global/home/users/sbalsam/ucc_shm/lib/libucc.so.1(ucc_team_create_test_single+0x454)[0x14a6e39aae14]
[helios017:1959753] [16] ucc_test_mpi[0x407803]
[helios017:1959753] [17] ucc_test_mpi[0x40964a]
[helios017:1959753] [18] ucc_test_mpi[0x405575]
[helios017:1959753] [19] /lib64/libc.so.6(__libc_start_main+0xf3)[0x14a6e1bf3ca3]
[helios017:1959753] [20] ucc_test_mpi[0x40611f]
[helios017:1959753] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 27 with PID 1959773 on node helios017 exited on signal 6 (Aborted).
@yosefe this flag is needed on helios for perf tests as well when running multi node.
Describe the bug
UCC mpitest fails to run on helios machine on hpcadvisorycouncil, unless flag is added -x UCX_IB_PREFER_NEAREST_DEVICE=n. Note that for example on Rome machine on hpcadvisorycouncil, mpitest runs without needing the extra flag.
Steps to Reproduce
Allocate 1 helios node on hpcadvisorycouncil clone UCC source hpcx-gcc-redhat7/hpcx-init.sh hpcx_load
Note in cmd make LD_LIBRARY_PATH correct to your local ucc/ucx
will fail: mpirun -x UCC_CL_BASIC_TLS=ucp -x UCC_CLS=basic --bind-to core -x UCC_TL_UCP_TUNE=inf -x LD_LIBRARY_PATH=/global/home/users/sbalsam/ucc_shm/lib:/global/home/users/sbalsam/ucx/build/install/lib:$LD_LIBRARY_PATH -np 28 --map-by core ucc_test_mpi --colls allreduce --mtypes host -m 4:16384 --onesided 0
will succeed: mpirun -x UCX_IB_PREFER_NEAREST_DEVICE=n -x UCC_CL_BASIC_TLS=ucp -x UCC_CLS=basic --bind-to core -x UCC_TL_UCP_TUNE=inf -x LD_LIBRARY_PATH=/global/home/users/sbalsam/ucc_shm/lib:/global/home/users/sbalsam/ucx/build/install/lib:$LD_LIBRARY_PATH -np 28 --map-by core ucc_test_mpi --colls allreduce --mtypes host -m 4:16384 --onesided 0