openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.14k stars 424 forks source link

UCC mpitest needs -x UCX_IB_PREFER_NEAREST_DEVICE=n flag to run on helios #8384

Open shimmybalsam opened 2 years ago

shimmybalsam commented 2 years ago

Describe the bug

UCC mpitest fails to run on helios machine on hpcadvisorycouncil, unless flag is added -x UCX_IB_PREFER_NEAREST_DEVICE=n. Note that for example on Rome machine on hpcadvisorycouncil, mpitest runs without needing the extra flag.

Steps to Reproduce

Allocate 1 helios node on hpcadvisorycouncil clone UCC source hpcx-gcc-redhat7/hpcx-init.sh hpcx_load

Note in cmd make LD_LIBRARY_PATH correct to your local ucc/ucx

will fail: mpirun -x UCC_CL_BASIC_TLS=ucp -x UCC_CLS=basic --bind-to core -x UCC_TL_UCP_TUNE=inf -x LD_LIBRARY_PATH=/global/home/users/sbalsam/ucc_shm/lib:/global/home/users/sbalsam/ucx/build/install/lib:$LD_LIBRARY_PATH -np 28 --map-by core ucc_test_mpi --colls allreduce --mtypes host -m 4:16384 --onesided 0

will succeed: mpirun -x UCX_IB_PREFER_NEAREST_DEVICE=n -x UCC_CL_BASIC_TLS=ucp -x UCC_CLS=basic --bind-to core -x UCC_TL_UCP_TUNE=inf -x LD_LIBRARY_PATH=/global/home/users/sbalsam/ucc_shm/lib:/global/home/users/sbalsam/ucx/build/install/lib:$LD_LIBRARY_PATH -np 28 --map-by core ucc_test_mpi --colls allreduce --mtypes host -m 4:16384 --onesided 0

yosefe commented 2 years ago

@shimmybalsam can you pls post the full output?

shimmybalsam commented 2 years ago

@yosefe full error output:

mpirun -x UCC_CL_BASIC_TLS=ucp -x UCC_CLS=basic --bind-to core -x UCC_TL_UCP_TUNE=inf -x LD_LIBRARY_PATH=/global/home/users/sbalsam/ucc_shm/lib:/global/home/users/sbalsam/ucx/build/install/lib:$LD_LIBRARY_PATH -np 28 --map-by core ucc_test_mpi --colls allreduce --mtypes host -m 4:16384 --onesided 0
[1657713098.977001] [helios017:1959767:0]          wireup.c:1087 UCX  ERROR   old: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe
[1657713098.977038] [helios017:1959767:0]          wireup.c:1097 UCX  ERROR   old: lane[0]:  7:sysv/memory.0 md[2]           -> md[2]/sysv/sysdev[255] am am_bw#0
[1657713098.977044] [helios017:1959767:0]          wireup.c:1097 UCX  ERROR   old: lane[1]: 31:xpmem/memory.0 md[10]          -> md[10]/xpmem/sysdev[255] rkey_ptr
[1657713098.977048] [helios017:1959767:0]          wireup.c:1097 UCX  ERROR   old: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6]      -> md[4]/ib/sysdev[255] rma_bw#0
[1657713098.977052] [helios017:1959767:0]          wireup.c:1097 UCX  ERROR   old: lane[3]: 11:rc_mlx5/mlx5_0:1.0 md[4]      -> md[6]/ib/sysdev[255] rma_bw#1 wireup
[1657713098.977056] [helios017:1959767:0]          wireup.c:1087 UCX  ERROR   new: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe
[1657713098.977060] [helios017:1959767:0]          wireup.c:1097 UCX  ERROR   new: lane[0]:  7:sysv/memory.0 md[2]           -> md[2]/sysv/sysdev[255] am am_bw#0
[1657713098.977065] [helios017:1959767:0]          wireup.c:1097 UCX  ERROR   new: lane[1]: 31:xpmem/memory.0 md[10]          -> md[10]/xpmem/sysdev[255] rkey_ptr
[1657713098.977069] [helios017:1959767:0]          wireup.c:1097 UCX  ERROR   new: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6]      -> md[4]/ib/sysdev[255] rma_bw#0
[1657713098.977073] [helios017:1959767:0]          wireup.c:1097 UCX  ERROR   new: lane[3]: 26:rc_mlx5/mlx5_3:1.0 md[7]      -> md[5]/ib/sysdev[255] rma_bw#1 wireup
[helios017:1959767:0:1959767]      wireup.c:1388 Fatal: endpoint reconfiguration not supported yet
[1657713098.979514] [helios017:1959759:0]          wireup.c:1087 UCX  ERROR   old: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe
[1657713098.979538] [helios017:1959759:0]          wireup.c:1097 UCX  ERROR   old: lane[0]:  7:sysv/memory.0 md[2]           -> md[2]/sysv/sysdev[255] am am_bw#0
[1657713098.979544] [helios017:1959759:0]          wireup.c:1097 UCX  ERROR   old: lane[1]: 31:xpmem/memory.0 md[10]          -> md[10]/xpmem/sysdev[255] rkey_ptr
[1657713098.979548] [helios017:1959759:0]          wireup.c:1097 UCX  ERROR   old: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6]      -> md[4]/ib/sysdev[255] rma_bw#0                                                                                
[1657713098.979552] [helios017:1959759:0]          wireup.c:1097 UCX  ERROR   old: lane[3]: 11:rc_mlx5/mlx5_0:1.0 md[4]      -> md[6]/ib/sysdev[255] rma_bw#1 wireup                                                                         
[1657713098.979556] [helios017:1959759:0]          wireup.c:1087 UCX  ERROR   new: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe                                                                      
[1657713098.979560] [helios017:1959759:0]          wireup.c:1097 UCX  ERROR   new: lane[0]:  7:sysv/memory.0 md[2]           -> md[2]/sysv/sysdev[255] am am_bw#0                                                                            
[1657713098.979565] [helios017:1959759:0]          wireup.c:1097 UCX  ERROR   new: lane[1]: 31:xpmem/memory.0 md[10]          -> md[10]/xpmem/sysdev[255] rkey_ptr                                                                           
[1657713098.979568] [helios017:1959759:0]          wireup.c:1097 UCX  ERROR   new: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6]      -> md[4]/ib/sysdev[255] rma_bw#0                                                                                
[1657713098.979571] [helios017:1959759:0]          wireup.c:1097 UCX  ERROR   new: lane[3]: 26:rc_mlx5/mlx5_3:1.0 md[7]      -> md[5]/ib/sysdev[255] rma_bw#1 wireup                                                                         
[helios017:1959759:0:1959759]      wireup.c:1388 Fatal: endpoint reconfiguration not supported yet                                                                                                                                           
[1657713098.980056] [helios017:1959753:0]          wireup.c:1087 UCX  ERROR   old: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe                                                                      
[1657713098.980080] [helios017:1959753:0]          wireup.c:1097 UCX  ERROR   old: lane[0]:  7:sysv/memory.0 md[2]           -> md[2]/sysv/sysdev[255] am am_bw#0                                                                            
[1657713098.980086] [helios017:1959753:0]          wireup.c:1097 UCX  ERROR   old: lane[1]: 31:xpmem/memory.0 md[10]          -> md[10]/xpmem/sysdev[255] rkey_ptr                                                                           
[1657713098.980090] [helios017:1959753:0]          wireup.c:1097 UCX  ERROR   old: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6]      -> md[4]/ib/sysdev[255] rma_bw#0                                                                                
[1657713098.980095] [helios017:1959753:0]          wireup.c:1097 UCX  ERROR   old: lane[3]: 11:rc_mlx5/mlx5_0:1.0 md[4]      -> md[6]/ib/sysdev[255] rma_bw#1 wireup                                                                         
[1657713098.980098] [helios017:1959753:0]          wireup.c:1087 UCX  ERROR   new: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe                                                                      
[1657713098.980103] [helios017:1959753:0]          wireup.c:1097 UCX  ERROR   new: lane[0]:  7:sysv/memory.0 md[2]           -> md[2]/sysv/sysdev[255] am am_bw#0                                                                            
[1657713098.980107] [helios017:1959753:0]          wireup.c:1097 UCX  ERROR   new: lane[1]: 31:xpmem/memory.0 md[10]          -> md[10]/xpmem/sysdev[255] rkey_ptr                                                                           
[1657713098.980110] [helios017:1959753:0]          wireup.c:1097 UCX  ERROR   new: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6]      -> md[4]/ib/sysdev[255] rma_bw#0                                                                                
[1657713098.980113] [helios017:1959753:0]          wireup.c:1097 UCX  ERROR   new: lane[3]: 26:rc_mlx5/mlx5_3:1.0 md[7]      -> md[5]/ib/sysdev[255] rma_bw#1 wireup
[helios017:1959753:0:1959753]      wireup.c:1388 Fatal: endpoint reconfiguration not supported yet
[1657713098.993884] [helios017:1959773:a]          wireup.c:1087 UCX  ERROR   old: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe
[1657713098.993913] [helios017:1959773:a]          wireup.c:1097 UCX  ERROR   old: lane[0]:  7:sysv/memory.0 md[2]           -> md[2]/sysv/sysdev[255] am am_bw#0
[1657713098.993917] [helios017:1959773:a]          wireup.c:1097 UCX  ERROR   old: lane[1]: 31:xpmem/memory.0 md[10]          -> md[10]/xpmem/sysdev[255] rkey_ptr
[1657713098.993921] [helios017:1959773:a]          wireup.c:1097 UCX  ERROR   old: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6]      -> md[4]/ib/sysdev[255] rma_bw#0
[1657713098.993924] [helios017:1959773:a]          wireup.c:1097 UCX  ERROR   old: lane[3]: 11:rc_mlx5/mlx5_0:1.0 md[4]      -> md[6]/ib/sysdev[255] rma_bw#1 wireup
[1657713098.993928] [helios017:1959773:a]          wireup.c:1087 UCX  ERROR   new: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe
[1657713098.993931] [helios017:1959773:a]          wireup.c:1097 UCX  ERROR   new: lane[0]:  7:sysv/memory.0 md[2]           -> md[2]/sysv/sysdev[255] am am_bw#0
[1657713098.993934] [helios017:1959773:a]          wireup.c:1097 UCX  ERROR   new: lane[1]: 31:xpmem/memory.0 md[10]          -> md[10]/xpmem/sysdev[255] rkey_ptr
[1657713098.993937] [helios017:1959773:a]          wireup.c:1097 UCX  ERROR   new: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6]      -> md[4]/ib/sysdev[255] rma_bw#0
[1657713098.993940] [helios017:1959773:a]          wireup.c:1097 UCX  ERROR   new: lane[3]: 26:rc_mlx5/mlx5_3:1.0 md[7]      -> md[5]/ib/sysdev[255] rma_bw#1 wireup
[helios017:1959773:a:1959833]      wireup.c:1388 Fatal: endpoint reconfiguration not supported yet
[1657713099.000867] [helios017:1959769:a]          wireup.c:1087 UCX  ERROR   old: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe
[1657713099.000897] [helios017:1959769:a]          wireup.c:1097 UCX  ERROR   old: lane[0]:  7:sysv/memory.0 md[2]           -> md[2]/sysv/sysdev[255] am am_bw#0
[1657713099.000901] [helios017:1959769:a]          wireup.c:1097 UCX  ERROR   old: lane[1]: 31:xpmem/memory.0 md[10]          -> md[10]/xpmem/sysdev[255] rkey_ptr
[1657713099.000905] [helios017:1959769:a]          wireup.c:1097 UCX  ERROR   old: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6]      -> md[4]/ib/sysdev[255] rma_bw#0
[1657713099.000909] [helios017:1959769:a]          wireup.c:1097 UCX  ERROR   old: lane[3]: 11:rc_mlx5/mlx5_0:1.0 md[4]      -> md[6]/ib/sysdev[255] rma_bw#1 wireup
[1657713099.000912] [helios017:1959769:a]          wireup.c:1087 UCX  ERROR   new: am_lane 0 wireup_msg_lane 3 cm_lane <none> keepalive_lane <none> reachable_mds 0x7fe
[1657713099.000915] [helios017:1959769:a]          wireup.c:1097 UCX  ERROR   new: lane[0]:  7:sysv/memory.0 md[2]           -> md[2]/sysv/sysdev[255] am am_bw#0
[1657713099.000918] [helios017:1959769:a]          wireup.c:1097 UCX  ERROR   new: lane[1]: 31:xpmem/memory.0 md[10]          -> md[10]/xpmem/sysdev[255] rkey_ptr
[1657713099.000921] [helios017:1959769:a]          wireup.c:1097 UCX  ERROR   new: lane[2]: 21:rc_mlx5/mlx5_2:1.0 md[6]      -> md[4]/ib/sysdev[255] rma_bw#0
[1657713099.000924] [helios017:1959769:a]          wireup.c:1097 UCX  ERROR   new: lane[3]: 26:rc_mlx5/mlx5_3:1.0 md[7]      -> md[5]/ib/sysdev[255] rma_bw#1 wireup
[helios017:1959769:a:1959834]      wireup.c:1388 Fatal: endpoint reconfiguration not supported yet

/global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c: [ ucp_wireup_init_lanes() ]
      ...
     1385                                 NULL, cm_idx, UCS_LOG_LEVEL_ERROR);
     1386         ucp_wireup_print_config(worker, &key, "new", NULL,
     1387                                 cm_idx, UCS_LOG_LEVEL_ERROR);
==>  1388         ucs_fatal("endpoint reconfiguration not supported yet");
     1389     }
     1390 
     1391     ep->cfg_index = new_cfg_index;

/global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c: [ ucp_wireup_init_lanes() ]
      ...
     1385                                 NULL, cm_idx, UCS_LOG_LEVEL_ERROR);
     1386         ucp_wireup_print_config(worker, &key, "new", NULL,
     1387                                 cm_idx, UCS_LOG_LEVEL_ERROR);
==>  1388         ucs_fatal("endpoint reconfiguration not supported yet");
     1389     }
     1390 
     1391     ep->cfg_index = new_cfg_index;

/global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c: [ ucp_wireup_init_lanes() ]
      ...
     1385                                 NULL, cm_idx, UCS_LOG_LEVEL_ERROR);
     1386         ucp_wireup_print_config(worker, &key, "new", NULL,
     1387                                 cm_idx, UCS_LOG_LEVEL_ERROR);
==>  1388         ucs_fatal("endpoint reconfiguration not supported yet");
     1389     }
     1390 
     1391     ep->cfg_index = new_cfg_index;

/global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c: [ ucp_wireup_init_lanes() ]
      ...

/global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c: [ ucp_wireup_init_lanes() ]
      ...
     1385                                 NULL, cm_idx, UCS_LOG_LEVEL_ERROR);
     1386         ucp_wireup_print_config(worker, &key, "new", NULL,
     1387                                 cm_idx, UCS_LOG_LEVEL_ERROR);
==>  1388         ucs_fatal("endpoint reconfiguration not supported yet");
     1389     }
     1390 
     1391     ep->cfg_index = new_cfg_index;

     1385                                 NULL, cm_idx, UCS_LOG_LEVEL_ERROR);
     1386         ucp_wireup_print_config(worker, &key, "new", NULL,
     1387                                 cm_idx, UCS_LOG_LEVEL_ERROR);
==>  1388         ucs_fatal("endpoint reconfiguration not supported yet");
     1389     }
     1390 
     1391     ep->cfg_index = new_cfg_index;

==== backtrace (tid:1959833) ====
 0 0x000000000009f169 ucp_wireup_init_lanes()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:1388
 1 0x000000000009d09a ucp_wireup_process_request()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:583
 2 0x000000000009d9a1 ucp_wireup_msg_handler()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:836
 3 0x000000000005d7a1 uct_iface_invoke_am()  /global/home/users/sbalsam/ucx/contrib/../src/uct/base/uct_iface.h:878
 4 0x000000000005d7a1 uct_ud_ep_process_rx()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_ep.c:1049
 5 0x0000000000066ac2 uct_ud_mlx5_iface_poll_rx()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:510
 6 0x0000000000066ac2 uct_ud_mlx5_iface_async_progress()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:586
 7 0x0000000000057fdb uct_ud_iface_async_progress()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_iface.c:256
 8 0x0000000000057fdb uct_ud_iface_async_handler()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_iface.c:267
 9 0x000000000004bdac ucs_async_handler_invoke()  /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/async.c:252
10 0x000000000004bdac ucs_async_handler_dispatch()  /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/async.c:274
11 0x000000000004bf75 ucs_async_dispatch_handlers()  /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/async.c:306
12 0x000000000004eb26 ucs_async_thread_ev_handler()  /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/thread.c:88
13 0x000000000006ba53 ucs_event_set_wait()  /global/home/users/sbalsam/ucx/contrib/../src/ucs/sys/event_set.c:215
14 0x000000000004ec6c ucs_async_thread_func()  /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/thread.c:131
15 0x00000000000081cf start_thread()  ???:0
16 0x0000000000039d83 __GI___clone()  :0
=================================
[helios017:1959773] *** Process received signal ***
[helios017:1959773] Signal: Aborted (6)
[helios017:1959773] Signal code:  (-6)
[helios017:1959773] [ 0] /lib64/libpthread.so.0(+0x12ce0)[0x154b4d4c3ce0]
[helios017:1959773] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x154b4c305a4f]
[helios017:1959773] [ 2] /lib64/libc.so.6(abort+0x127)[0x154b4c2d8db5]
[helios017:1959773] [ 3] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e345)[0x154b4dd57345]
[helios017:1959773] [ 4] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e419)[0x154b4dd57419]
[helios017:1959773] [ 5] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(ucp_wireup_init_lanes+0x359)[0x154b3fbcb169]
[helios017:1959773] [ 6] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d09a)[0x154b3fbc909a]
[helios017:1959773] [ 7] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d9a1)[0x154b3fbc99a1]
[helios017:1959773] [ 8] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x1f1)[0x154b3ee887a1]
[helios017:1959773] [ 9] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(+0x66ac2)[0x154b3ee91ac2]
[helios017:1959773] [10] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(+0x57fdb)[0x154b3ee82fdb]
[helios017:1959773] [11] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x4bdac)[0x154b4dd44dac]
[helios017:1959773] [12] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(ucs_async_dispatch_handlers+0xe5)[0x154b4dd44f75]
[helios017:1959773] [13] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x4eb26)[0x154b4dd47b26]
[helios017:1959773] [14] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(ucs_event_set_wait+0xa3)[0x154b4dd64a53]
[helios017:1959773] [15] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x4ec6c)[0x154b4dd47c6c]
[helios017:1959773] [16] /lib64/libpthread.so.0(+0x81cf)[0x154b4d4b91cf]
[helios017:1959773] [17] /lib64/libc.so.6(clone+0x43)[0x154b4c2f0d83]
[helios017:1959773] *** End of error message ***
==== backtrace (tid:1959834) ====
 0 0x000000000009f169 ucp_wireup_init_lanes()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:1388
 1 0x000000000009d09a ucp_wireup_process_request()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:583
 2 0x000000000009d9a1 ucp_wireup_msg_handler()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:836
 3 0x000000000005d7a1 uct_iface_invoke_am()  /global/home/users/sbalsam/ucx/contrib/../src/uct/base/uct_iface.h:878
 4 0x000000000005d7a1 uct_ud_ep_process_rx()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_ep.c:1049
 5 0x0000000000066ac2 uct_ud_mlx5_iface_poll_rx()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:510
 6 0x0000000000066ac2 uct_ud_mlx5_iface_async_progress()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:586
 7 0x000000000005804b uct_ud_iface_async_progress()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_iface.c:256
 8 0x000000000005804b uct_ud_iface_timer()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_iface.c:286
 9 0x000000000004bdac ucs_async_handler_invoke()  /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/async.c:252
10 0x000000000004bdac ucs_async_handler_dispatch()  /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/async.c:274
11 0x000000000004bf75 ucs_async_dispatch_handlers()  /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/async.c:306
12 0x000000000004c130 ucs_async_dispatch_timerq()  /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/async.c:333
13 0x000000000004ecb5 ucs_async_thread_func()  /global/home/users/sbalsam/ucx/contrib/../src/ucs/async/thread.c:142
14 0x00000000000081cf start_thread()  ???:0
15 0x0000000000039d83 __GI___clone()  :0
=================================
[helios017:1959769] *** Process received signal ***
[helios017:1959769] Signal: Aborted (6)
[helios017:1959769] Signal code:  (-6)
[helios017:1959769] [ 0] /lib64/libpthread.so.0(+0x12ce0)[0x1477964b8ce0]
[helios017:1959769] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x1477952faa4f]
[helios017:1959769] [ 2] /lib64/libc.so.6(abort+0x127)[0x1477952cddb5]
[helios017:1959769] [ 3] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e345)[0x147796d4c345]
[helios017:1959769] [ 4] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e419)[0x147796d4c419]
[helios017:1959769] [ 5] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(ucp_wireup_init_lanes+0x359)[0x14778cd57169]
[helios017:1959769] [ 6] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d09a)[0x14778cd5509a]
[helios017:1959769] [ 7] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d9a1)[0x14778cd559a1]
[helios017:1959769] [ 8] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x1f1)[0x147787dd47a1]
[helios017:1959769] [ 9] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(+0x66ac2)[0x147787dddac2]
[helios017:1959769] [10] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(+0x5804b)[0x147787dcf04b]
[helios017:1959769] [11] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x4bdac)[0x147796d39dac]
[helios017:1959769] [12] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(ucs_async_dispatch_handlers+0xe5)[0x147796d39f75]
[helios017:1959769] [13] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(ucs_async_dispatch_timerq+0xd0)[0x147796d3a130]
[helios017:1959769] [14] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x4ecb5)[0x147796d3ccb5]
[helios017:1959769] [15] /lib64/libpthread.so.0(+0x81cf)[0x1477964ae1cf]
[helios017:1959769] [16] /lib64/libc.so.6(clone+0x43)[0x1477952e5d83]
[helios017:1959769] *** End of error message ***
==== backtrace (tid:1959767) ====
 0 0x000000000009f169 ucp_wireup_init_lanes()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:1388
 1 0x000000000009d09a ucp_wireup_process_request()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:583
 2 0x000000000009d9a1 ucp_wireup_msg_handler()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:836
 3 0x000000000005d7a1 uct_iface_invoke_am()  /global/home/users/sbalsam/ucx/contrib/../src/uct/base/uct_iface.h:878
 4 0x000000000005d7a1 uct_ud_ep_process_rx()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_ep.c:1049
 5 0x0000000000066517 uct_ud_mlx5_iface_poll_rx()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:510
 6 0x0000000000066517 uct_ud_mlx5_iface_progress()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:567
 7 0x0000000000046422 ucs_callbackq_dispatch()  /global/home/users/sbalsam/ucx/contrib/../src/ucs/datastruct/callbackq.h:211
 8 0x0000000000046422 uct_worker_progress()  /global/home/users/sbalsam/ucx/contrib/../src/uct/api/uct.h:2647
 9 0x0000000000046422 ucp_worker_progress()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/core/ucp_worker.c:2804
10 0x0000000000016008 ucc_tl_ucp_test()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/components/tl/ucp/../../../../../src/components/tl/ucp/allreduce/../tl_ucp_coll.h:266
11 0x0000000000016008 ucc_tl_ucp_allreduce_knomial_progress()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/components/tl/ucp/../../../../../src/components/tl/ucp/allreduce/allreduce_knomial.c:129
12 0x000000000000f60b ucc_pq_st_progress()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_progress_queue_st.c:31
13 0x000000000000a7ae ucc_progress_queue()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_progress_queue.h:46
14 0x00000000004077ea UccTestMpi::create_ucc_team()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:214
15 0x000000000040964a UccTestMpi::create_teams()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:234
16 0x000000000040964a UccTestMpi::create_teams()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:156
17 0x0000000000405575 main()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/main.cc:522
18 0x000000000003aca3 __libc_start_main()  ???:0
19 0x000000000040611f _start()  ???:0
=================================
[helios017:1959767] *** Process received signal ***
[helios017:1959767] Signal: Aborted (6)
[helios017:1959767] Signal code:  (-6)
[helios017:1959767] [ 0] /lib64/libpthread.so.0(+0x12ce0)[0x14c9ac6b8ce0]
[helios017:1959767] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x14c9ab4faa4f]
[helios017:1959767] [ 2] /lib64/libc.so.6(abort+0x127)[0x14c9ab4cddb5]
[helios017:1959767] [ 3] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e345)[0x14c9acf4c345]
[helios017:1959767] [ 4] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e419)[0x14c9acf4c419]
[helios017:1959767] [ 5] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(ucp_wireup_init_lanes+0x359)[0x14c99ed9f169]
[helios017:1959767] [ 6] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d09a)[0x14c99ed9d09a]
[helios017:1959767] [ 7] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d9a1)[0x14c99ed9d9a1]
[helios017:1959767] [ 8] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x1f1)[0x14c99e05c7a1]
[helios017:1959767] [ 9] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(+0x66517)[0x14c99e065517]
[helios017:1959767] [10] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(ucp_worker_progress+0x22)[0x14c99ed46422]
[helios017:1959767] [11] /global/home/users/sbalsam/ucc_shm/lib/ucc/libucc_tl_ucp.so(ucc_tl_ucp_allreduce_knomial_progress+0x488)[0x14c95a3dd008]
[helios017:1959767] [12] /global/home/users/sbalsam/ucc_shm/lib/libucc.so.1(+0xf60b)[0x14c9ad2a060b]
[helios017:1959767] [13] /global/home/users/sbalsam/ucc_shm/lib/libucc.so.1(ucc_context_progress+0x3e)[0x14c9ad29b7ae]
[helios017:1959767] [14] ucc_test_mpi[0x4077ea]
[helios017:1959767] [15] ucc_test_mpi[0x40964a]
[helios017:1959767] [16] ucc_test_mpi[0x405575]
[helios017:1959767] [17] /lib64/libc.so.6(__libc_start_main+0xf3)[0x14c9ab4e6ca3]
[helios017:1959767] [18] ucc_test_mpi[0x40611f]
[helios017:1959767] *** End of error message ***
==== backtrace (tid:1959759) ====
 0 0x000000000009f169 ucp_wireup_init_lanes()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:1388
 1 0x000000000009d09a ucp_wireup_process_request()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:583
 2 0x000000000009d9a1 ucp_wireup_msg_handler()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:836
 3 0x000000000005d7a1 uct_iface_invoke_am()  /global/home/users/sbalsam/ucx/contrib/../src/uct/base/uct_iface.h:878
 4 0x000000000005d7a1 uct_ud_ep_process_rx()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_ep.c:1049
 5 0x0000000000066517 uct_ud_mlx5_iface_poll_rx()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:510
 6 0x0000000000066517 uct_ud_mlx5_iface_progress()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:567
 7 0x0000000000056553 ucs_callbackq_slow_proxy()  /global/home/users/sbalsam/ucx/contrib/../src/ucs/datastruct/callbackq.c:404
 8 0x0000000000046422 ucs_callbackq_dispatch()  /global/home/users/sbalsam/ucx/contrib/../src/ucs/datastruct/callbackq.h:211
 9 0x0000000000046422 uct_worker_progress()  /global/home/users/sbalsam/ucx/contrib/../src/uct/api/uct.h:2647
10 0x0000000000046422 ucp_worker_progress()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/core/ucp_worker.c:2804
11 0x0000000000016008 ucc_tl_ucp_test()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/components/tl/ucp/../../../../../src/components/tl/ucp/allreduce/../tl_ucp_coll.h:266
12 0x0000000000016008 ucc_tl_ucp_allreduce_knomial_progress()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/components/tl/ucp/../../../../../src/components/tl/ucp/allreduce/allreduce_knomial.c:129
13 0x000000000000f60b ucc_pq_st_progress()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_progress_queue_st.c:31
14 0x000000000000a7ae ucc_progress_queue()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_progress_queue.h:46
15 0x000000000000ce14 ucc_team_alloc_id()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_team.c:581
16 0x000000000000ce14 ucc_team_create_test_single()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_team.c:418
17 0x0000000000407803 UccTestMpi::create_ucc_team()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:213
18 0x000000000040964a UccTestMpi::create_teams()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:234
19 0x000000000040964a UccTestMpi::create_teams()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:156
20 0x0000000000405575 main()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/main.cc:522
21 0x000000000003aca3 __libc_start_main()  ???:0
22 0x000000000040611f _start()  ???:0
=================================
[helios017:1959759] *** Process received signal ***
[helios017:1959759] Signal: Aborted (6)
[helios017:1959759] Signal code:  (-6)
[helios017:1959759] [ 0] /lib64/libpthread.so.0(+0x12ce0)[0x15099bde5ce0]
[helios017:1959759] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x15099ac27a4f]
[helios017:1959759] [ 2] /lib64/libc.so.6(abort+0x127)[0x15099abfadb5]
[helios017:1959759] [ 3] ==== backtrace (tid:1959753) ====
 0 0x000000000009f169 ucp_wireup_init_lanes()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:1388
/global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e345)[0x15099c679345]
[helios017:1959759] [ 4] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e419)[0x15099c679419]
 1 0x000000000009d09a ucp_wireup_process_request()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:583
 2 0x000000000009d9a1 ucp_wireup_msg_handler()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/wireup/wireup.c:836
 3 0x000000000005d7a1 uct_iface_invoke_am()  /global/home/users/sbalsam/ucx/contrib/../src/uct/base/uct_iface.h:878
 4 0x000000000005d7a1 uct_ud_ep_process_rx()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/base/ud_ep.c:1049
 5 0x0000000000066517 uct_ud_mlx5_iface_poll_rx()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:510
 6 0x0000000000066517 uct_ud_mlx5_iface_progress()  /global/home/users/sbalsam/ucx/contrib/../src/uct/ib/ud/accel/ud_mlx5.c:567
 7 0x0000000000056553 ucs_callbackq_slow_proxy()  /global/home/users/sbalsam/ucx/contrib/../src/ucs/datastruct/callbackq.c:404
 8 0x0000000000046422 ucs_callbackq_dispatch()  /global/home/users/sbalsam/ucx/contrib/../src/ucs/datastruct/callbackq.h:211
 9 0x0000000000046422 uct_worker_progress()  /global/home/users/sbalsam/ucx/contrib/../src/uct/api/uct.h:2647
10 0x0000000000046422 ucp_worker_progress()  /global/home/users/sbalsam/ucx/contrib/../src/ucp/core/ucp_worker.c:2804
[helios017:1959759] [ 5] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(ucp_wireup_init_lanes+0x359)[0x15098e574169]
[helios017:1959759] [ 6] 11 0x0000000000016008 ucc_tl_ucp_test()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/components/tl/ucp/../../../../../src/components/tl/ucp/allreduce/../tl_ucp_coll.h:266
12 0x0000000000016008 ucc_tl_ucp_allreduce_knomial_progress()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/components/tl/ucp/../../../../../src/components/tl/ucp/allreduce/allreduce_knomial.c:129
13 0x000000000000f60b ucc_pq_st_progress()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_progress_queue_st.c:31
14 0x000000000000a7ae ucc_progress_queue()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_progress_queue.h:46
15 0x000000000000ce14 ucc_team_alloc_id()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_team.c:581
16 0x000000000000ce14 ucc_team_create_test_single()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/src/../../src/core/ucc_team.c:418
17 0x0000000000407803 UccTestMpi::create_ucc_team()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:213
18 0x000000000040964a UccTestMpi::create_teams()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:234
19 0x000000000040964a UccTestMpi::create_teams()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/test_mpi.cc:156
20 0x0000000000405575 main()  /.autodirect/mtrsysgwork/sbalsam/ucc/build/test/mpi/../../../test/mpi/main.cc:522
21 0x000000000003aca3 __libc_start_main()  ???:0
22 0x000000000040611f _start()  ???:0
=================================
/global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d09a)[0x15098e57209a]
[helios017:1959759] [ 7] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d9a1)[0x15098e5729a1]
[helios017:1959759] [ 8] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x1f1)[0x15098d8317a1]
[helios017:1959759] [ 9] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(+0x66517)[0x15098d83a517]
[helios017:1959759] [10] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x56553)[0x15099c671553]
[helios017:1959759] [11] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(ucp_worker_progress+0x22)[0x15098e51b422]
[helios017:1959759] [12] /global/home/users/sbalsam/ucc_shm/lib/ucc/libucc_tl_ucp.so(ucc_tl_ucp_allreduce_knomial_progress+0x488)[0x1509515f8008]
[helios017:1959759] [13] /global/home/users/sbalsam/ucc_shm/lib/libucc.so.1(+0xf60b)[0x15099c9cd60b]
[helios017:1959759] [14] /global/home/users/sbalsam/ucc_shm/lib/libucc.so.1(ucc_context_progress+0x3e)[0x15099c9c87ae]
[helios017:1959753] *** Process received signal ***
[helios017:1959753] Signal: Aborted (6)
[helios017:1959753] Signal code:  (-6)
[helios017:1959759] [15] /global/home/users/sbalsam/ucc_shm/lib/libucc.so.1(ucc_team_create_test_single+0x454)[0x15099c9cae14]
[helios017:1959759] [16] ucc_test_mpi[0x407803]
[helios017:1959759] [17] ucc_test_mpi[0x40964a]
[helios017:1959759] [18] ucc_test_mpi[0x405575]
[helios017:1959759] [19] /lib64/libc.so.6(__libc_start_main+0xf3)[0x15099ac13ca3]
[helios017:1959759] [20] ucc_test_mpi[0x40611f]
[helios017:1959759] *** End of error message ***
[helios017:1959753] [ 0] /lib64/libpthread.so.0(+0x12ce0)[0x14a6e2dc5ce0]
[helios017:1959753] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x14a6e1c07a4f]
[helios017:1959753] [ 2] /lib64/libc.so.6(abort+0x127)[0x14a6e1bdadb5]
[helios017:1959753] [ 3] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e345)[0x14a6e3659345]
[helios017:1959753] [ 4] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x5e419)[0x14a6e3659419]
[helios017:1959753] [ 5] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(ucp_wireup_init_lanes+0x359)[0x14a6d9664169]
[helios017:1959753] [ 6] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d09a)[0x14a6d966209a]
[helios017:1959753] [ 7] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(+0x9d9a1)[0x14a6d96629a1]
[helios017:1959753] [ 8] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x1f1)[0x14a6d89217a1]
[helios017:1959753] [ 9] /global/home/users/sbalsam/ucx/build/install/lib/ucx/libuct_ib.so.0(+0x66517)[0x14a6d892a517]
[helios017:1959753] [10] /global/home/users/sbalsam/ucx/build/install/lib/libucs.so.0(+0x56553)[0x14a6e3651553]
[helios017:1959753] [11] /global/home/users/sbalsam/ucx/build/install/lib/libucp.so.0(ucp_worker_progress+0x22)[0x14a6d960b422]
[helios017:1959753] [12] /global/home/users/sbalsam/ucc_shm/lib/ucc/libucc_tl_ucp.so(ucc_tl_ucp_allreduce_knomial_progress+0x488)[0x14a694082008]
[helios017:1959753] [13] /global/home/users/sbalsam/ucc_shm/lib/libucc.so.1(+0xf60b)[0x14a6e39ad60b]
[helios017:1959753] [14] /global/home/users/sbalsam/ucc_shm/lib/libucc.so.1(ucc_context_progress+0x3e)[0x14a6e39a87ae]
[helios017:1959753] [15] /global/home/users/sbalsam/ucc_shm/lib/libucc.so.1(ucc_team_create_test_single+0x454)[0x14a6e39aae14]
[helios017:1959753] [16] ucc_test_mpi[0x407803]
[helios017:1959753] [17] ucc_test_mpi[0x40964a]
[helios017:1959753] [18] ucc_test_mpi[0x405575]
[helios017:1959753] [19] /lib64/libc.so.6(__libc_start_main+0xf3)[0x14a6e1bf3ca3]
[helios017:1959753] [20] ucc_test_mpi[0x40611f]
[helios017:1959753] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 27 with PID 1959773 on node helios017 exited on signal 6 (Aborted).
shimmybalsam commented 2 years ago

@yosefe this flag is needed on helios for perf tests as well when running multi node.