Open tonycurtis opened 2 years ago
@tonycurtis can you pls prove more details on the hang including command line output and backtrace?
Here's the UCX_LOG_LEVEL=info output:
[1653170535.008386] [er02:593424:0] ucp_context.c:1855 UCX INFO Version 1.14.0 (loaded from /home/arcurtis/opt/x86_64/ucx/git/lib/libucp.so.0)
[1653170535.009223] [er04:601925:0] ucp_context.c:1855 UCX INFO Version 1.14.0 (loaded from /home/arcurtis/opt/x86_64/ucx/git/lib/libucp.so.0)
[1653170535.020528] [er01:595504:0] ucp_context.c:1855 UCX INFO Version 1.14.0 (loaded from /home/arcurtis/opt/x86_64/ucx/git/lib/libucp.so.0)
[1653170535.170537] [er04:601925:0] parser.c:1916 UCX INFO UCX_* env variable: UCX_LOG_LEVEL=info
[1653170535.185026] [er02:593424:0] parser.c:1916 UCX INFO UCX_* env variable: UCX_LOG_LEVEL=info
[1653170535.195224] [er01:595504:0] parser.c:1916 UCX INFO UCX_* env variable: UCX_LOG_LEVEL=info
[1653170536.392986] [er03:589768:0] ucp_context.c:1855 UCX INFO Version 1.14.0 (loaded from /home/arcurtis/opt/x86_64/ucx/git/lib/libucp.so.0)
[1653170536.546282] [er03:589768:0] parser.c:1916 UCX INFO UCX_* env variable: UCX_LOG_LEVEL=info
[1653170536.552579] [er02:593424:0] ucp_worker.c:1782 UCX INFO ep_cfg[0]: rma(rc_mlx5/mlx5_0:1) amo(rc_mlx5/mlx5_0:1)
[1653170536.553352] [er04:601925:0] ucp_worker.c:1782 UCX INFO ep_cfg[0]: rma(rc_mlx5/mlx5_0:1) amo(rc_mlx5/mlx5_0:1)
[1653170536.553669] [er03:589768:0] ucp_worker.c:1782 UCX INFO ep_cfg[0]: rma(rc_mlx5/mlx5_0:1) amo(rc_mlx5/mlx5_0:1)
[1653170536.554232] [er01:595504:0] ucp_worker.c:1782 UCX INFO ep_cfg[0]: rma(xpmem/memory) amo(rc_mlx5/mlx5_0:1)
[1653170536.563677] [er03:589768:0] ucp_worker.c:1782 UCX INFO ep_cfg[1]: rma(xpmem/memory) amo(rc_mlx5/mlx5_0:1)
[1653170536.568505] [er02:593424:0] ucp_worker.c:1782 UCX INFO ep_cfg[1]: rma(xpmem/memory) amo(rc_mlx5/mlx5_0:1)
[1653170536.569692] [er01:595504:0] ucp_worker.c:1782 UCX INFO ep_cfg[1]: rma(rc_mlx5/mlx5_0:1) amo(rc_mlx5/mlx5_0:1)
[1653170536.571107] [er04:601925:0] ucp_worker.c:1782 UCX INFO ep_cfg[1]: rma(xpmem/memory) amo(rc_mlx5/mlx5_0:1)
<and then nothing>
No problem with 1 PE/rank, but as soon as 2 with rc/dc in the mix, hangs
Here's the UCX_LOG_LEVEL=debug output for 2 PEs:
$ UCX_TLS=rc oshrun -n 2 ./a.out
[1653171612.558177] [er-head:875750:0] debug.c:1146 UCX DEBUG using signal stack 0x7ffff7fbb000 size 141824
[1653171612.558821] [er-head:875751:0] debug.c:1146 UCX DEBUG using signal stack 0x7ffff7fbb000 size 141824
[1653171612.571083] [er-head:875750:0] cpu.c:338 UCX DEBUG measured tsc frequency 3092.653 MHz after 1.00 ms
[1653171612.571106] [er-head:875750:0] init.c:118 UCX DEBUG /home/arcurtis/opt/x86_64/ucx/git/lib/libucs.so.0 loaded at 0x7ffff5af7000
[1653171612.571124] [er-head:875750:0] init.c:120 UCX DEBUG cmd line: ./a.out
[1653171612.571138] [er-head:875750:0] module.c:72 UCX DEBUG ucs library path: /home/arcurtis/opt/x86_64/ucx/git/lib/libucs.so.0
[1653171612.571146] [er-head:875750:0] module.c:280 UCX DEBUG loading modules for ucs
[1653171612.572388] [er-head:875751:0] cpu.c:338 UCX DEBUG measured tsc frequency 3093.330 MHz after 0.33 ms
[1653171612.572408] [er-head:875751:0] init.c:118 UCX DEBUG /home/arcurtis/opt/x86_64/ucx/git/lib/libucs.so.0 loaded at 0x7ffff5af7000
[1653171612.572422] [er-head:875751:0] init.c:120 UCX DEBUG cmd line: ./a.out
[1653171612.572433] [er-head:875751:0] module.c:72 UCX DEBUG ucs library path: /home/arcurtis/opt/x86_64/ucx/git/lib/libucs.so.0
[1653171612.572439] [er-head:875751:0] module.c:280 UCX DEBUG loading modules for ucs
[1653171612.580809] [er-head:875750:0] time.c:22 UCX DEBUG arch clock frequency: 3092653000.00 Hz
[1653171612.580812] [er-head:875751:0] time.c:22 UCX DEBUG arch clock frequency: 3093330303.03 Hz
[1653171612.580849] [er-head:875750:0] ucp_context.c:1855 UCX INFO Version 1.14.0 (loaded from /home/arcurtis/opt/x86_64/ucx/git/lib/libucp.so.0)
[1653171612.580857] [er-head:875750:0] ucp_context.c:1636 UCX DEBUG estimated number of endpoints is 2
[1653171612.580861] [er-head:875750:0] ucp_context.c:1643 UCX DEBUG estimated number of endpoints per node is 2
[1653171612.580865] [er-head:875750:0] ucp_context.c:1650 UCX DEBUG estimated bcopy bandwidth is 5251268608.000000
[1653171612.580848] [er-head:875751:0] ucp_context.c:1855 UCX INFO Version 1.14.0 (loaded from /home/arcurtis/opt/x86_64/ucx/git/lib/libucp.so.0)
[1653171612.580856] [er-head:875751:0] ucp_context.c:1636 UCX DEBUG estimated number of endpoints is 2
[1653171612.580859] [er-head:875751:0] ucp_context.c:1643 UCX DEBUG estimated number of endpoints per node is 2
[1653171612.580865] [er-head:875751:0] ucp_context.c:1650 UCX DEBUG estimated bcopy bandwidth is 5251268608.000000
[1653171612.580875] [er-head:875750:0] ucp_context.c:1704 UCX DEBUG allocation method[0] is md 'sysv'
[1653171612.580880] [er-head:875750:0] ucp_context.c:1704 UCX DEBUG allocation method[1] is md 'posix'
[1653171612.580886] [er-head:875750:0] ucp_context.c:1716 UCX DEBUG allocation method[2] is 'huge'
[1653171612.580890] [er-head:875750:0] ucp_context.c:1716 UCX DEBUG allocation method[3] is 'thp'
[1653171612.580893] [er-head:875750:0] ucp_context.c:1704 UCX DEBUG allocation method[4] is md '*'
[1653171612.580875] [er-head:875751:0] ucp_context.c:1704 UCX DEBUG allocation method[0] is md 'sysv'
[1653171612.580878] [er-head:875751:0] ucp_context.c:1704 UCX DEBUG allocation method[1] is md 'posix'
[1653171612.580885] [er-head:875751:0] ucp_context.c:1716 UCX DEBUG allocation method[2] is 'huge'
[1653171612.580889] [er-head:875751:0] ucp_context.c:1716 UCX DEBUG allocation method[3] is 'thp'
[1653171612.580891] [er-head:875751:0] ucp_context.c:1704 UCX DEBUG allocation method[4] is md '*'
[1653171612.580895] [er-head:875751:0] ucp_context.c:1716 UCX DEBUG allocat modules for uct_ib
[1653171612.583846] [er-head:875750:0] mm_xpmem.c:90 UCX DEBUG xpmem version: 155653
[1653171612.583861] [er-head:875751:0] mm_xpmem.c:90 UCX DEBUG xpmem version: 155653
[1653171612.583901] [er-head:875750:0] ucp_context.c:1370 UCX DEBUG closing md self because it has no selected transport resources
[1653171612.583903] [er-head:875751:0] ucp_context.c:1370 UCX DEBUG closing md self because it has no selected transport resources
[1653171612.584355] [er-head:875750:0] ucp_context.c:1370 UCX DEBUG closing md tcp because it has no selected transport resources
[1653171612.584355] [er-head:875751:0] ucp_context.c:1370 UCX DEBUG closing md tcp because it has no selected transport resources
[1653171612.584385] [er-head:875751:0] ucp_context.c:1370 UCX DEBUG closing md sysv because it has no selected transport resources
[1653171612.584392] [er-head:875750:0] ucp_context.c:1370 UCX DEBUG closing md sysv because it has no selected transport resources
[1653171612.584433] [er-head:875751:0] ucp_context.c:1370 UCX DEBUG closing md posix because it has no selected transport resources
[1653171612.584441] [er-head:875750:0] ucp_context.c:1370 UCX DEBUG closing md posix because it has no selected transport resources
[1653171612.590555] [er-head:875750:0] ib_device.c:615 UCX DEBUG mlx5_0: PF sysfs path is '/sys/devices/pci0000:40/0000:40:01.1/0000:41:00.0'
[1653171612.590569] [er-head:875750:0] topo.c:175 UCX DEBUG added sys_dev 0 for bus id 41:00.0
[1653171612.590576] [er-head:875750:0] ib_device.c:646 UCX DEBUG mlx5_0: bdf_name 0000:41:00.0 sys_dev 0
[1653171612.590599] [er-head:875750:0] ib_device.c:670 UCX DEBUG mlx5_0: vendor_id 0x15b3 device_id 4119
[1653171612.590958] [er-head:875750:0] ib_mlx5dv_md.c:710 UCX DEBUG mlx5_0: disable ODP because it's not supported for DEVX QP
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.593199] [er-head:875750:0] async.c:231 UCX DEBUG added async handler 0x670020 [id=10 ref 1] uct_ib_async_event_handler() to hash
[1653171612.593294] [er-head:875750:0] async.c:493 UCX DEBUG listening to async event fd 10 events 0x1 mode thread_spinlock
[1653171612.593307] [er-head:875750:0] ib_device.c:834 UCX DEBUG initialized device 'mlx5_0' (InfiniBand channel adapter) with 1 ports
[1653171612.593510] [er-head:875750:0] ib_md.c:1592 UCX DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1653171612.593517] [er-head:875750:0] ib_md.c:1592 UCX DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1653171612.593535] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1653171612.599511] [er-head:875751:0] ib_device.c:615 UCX DEBUG mlx5_0: PF sysfs path is '/sys/devices/pci0000:40/0000:40:01.1/0000:41:00.0'
[1653171612.599535] [er-head:875751:0] topo.c:175 UCX DEBUG added sys_dev 0 for bus id 41:00.0
[1653171612.599540] [er-head:875751:0] ib_device.c:646 UCX DEBUG mlx5_0: bdf_name 0000:41:00.0 sys_dev 0
[1653171612.599561] [er-head:875751:0] ib_device.c:670 UCX DEBUG mlx5_0: vendor_id 0x15b3 device_id 4119
[1653171612.599871] [er-head:875751:0] ib_mlx5dv_md.c:710 UCX DEBUG mlx5_0: disable ODP because it's not supported for DEVX QP
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.602034] [er-head:875751:0] async.c:231 UCX DEBUG added async handler 0x670020 [id=10 ref 1] uct_ib_async_event_handler() to hash
[1653171612.602137] [er-head:875751:0] async.c:493 UCX DEBUG listening to async event fd 10 events 0x1 mode thread_spinlock
[1653171612.602149] [er-head:875751:0] ib_device.c:834 UCX DEBUG initialized device 'mlx5_0' (InfiniBand channel adapter) with 1 ports
[1653171612.602271] [er-head:875751:0] ib_md.c:1592 UCX DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1653171612.602278] [er-head:875751:0] ib_md.c:1592 UCX DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1653171612.602293] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1653171612.742054] [er-head:875750:0] async.c:231 UCX DEBUG added async handler 0x671520 [id=14 ref 1] ucs_rcache_invalidate_handler() to hash
[1653171612.742082] [er-head:875750:0] async.c:493 UCX DEBUG listening to async event fd 14 events 0x1 mode thread_spinlock
[1653171612.742224] [er-head:875750:0] module.c:280 UCX DEBUG loading modules for ucm
[1653171612.742239] [er-head:875750:0] ib_md.c:1314 UCX DEBUG mlx5_0: using registration cache
[1653171612.742247] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1653171612.742545] [er-head:875750:0] ib_mlx5dv_md.c:1065 UCX DEBUG mlx5_0: opened DEVX md
[1653171612.742550] [er-head:875750:0] ib_md.c:1541 UCX DEBUG mlx5_0: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1653171612.748182] [er-head:875750:0] ib_device.c:615 UCX DEBUG mlx5_1: PF sysfs path is '/sys/devices/pci0000:40/0000:40:01.1/0000:41:00.1'
[1653171612.748192] [er-head:875750:0] topo.c:175 UCX DEBUG added sys_dev 1 for bus id 41:00.1
[1653171612.748198] [er-head:875750:0] ib_device.c:646 UCX DEBUG mlx5_1: bdf_name 0000:41:00.1 sys_dev 1
[1653171612.748218] [er-head:875750:0] ib_device.c:670 UCX DEBUG mlx5_1: vendor_id 0x15b3 device_id 4119
[1653171612.748543] [er-head:875750:0] ib_mlx5dv_md.c:710 UCX DEBUG mlx5_1: disable ODP because it's not supported for DEVX QP
[1653171612.748722] [er-head:875750:0] async.c:231 UCX DEBUG added async handler 0x71f7e0 [id=17 ref 1] uct_ib_async_event_handler() to hash
[1653171612.748732] [er-head:875750:0] async.c:493 UCX DEBUG listening to async event fd 17 events 0x1 mode thread_spinlock
[1653171612.748738] [er-head:875750:0] ib_device.c:834 UCX DEBUG initialized device 'mlx5_1' (InfiniBand channel adapter) with 1 ports
[1653171612.748878] [er-head:875750:0] ib_md.c:1592 UCX DEBUG mlx5_1: cuda GPUDirect RDMA is disabled
[1653171612.748884] [er-head:875750:0] ib_md.c:1592 UCX DEBUG mlx5_1: rocm GPUDirect RDMA is disabled
[1653171612.748890] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1653171612.748965] [er-head:875750:0] ib_md.c:1314 UCX DEBUG mlx5_1: using registration cache
[1653171612.748970] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1653171612.749151] [er-head:875750:0] ib_mlx5dv_md.c:1065 UCX DEBUG mlx5_1: opened DEVX md
[1653171612.749155] [er-head:875750:0] ib_md.c:1541 UCX DEBUG mlx5_1: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1653171612.750308] [er-head:875750:0] ucp_context.c:1370 UCX DEBUG closing md cma because it has no selected transport resources
[1653171612.750345] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1653171612.750422] [er-head:875750:0] ucp_context.c:1370 UCX DEBUG closing md knem because it has no selected transport resources
[1653171612.750441] [er-head:875750:0] pgtable.c:618 UCX DEBUG purge empty page table
[1653171612.750447] [er-head:875750:0] mpool.c:154 UCX DEBUG mpool rcache_mp destroyed
[1653171612.750459] [er-head:875750:0] mm_xpmem.c:90 UCX DEBUG xpmem version: 155653
[1653171612.750491] [er-head:875750:0] ucp_context.c:1370 UCX DEBUG closing md xpmem because it has no selected transport resources
[1653171612.750511] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1653171612.750566] [er-head:875750:0] ucp_context.c:1924 UCX DEBUG created ucp context osss-ucx 0x674a20 [2 mds 8 tls] features 0xe tl bitmap 0xff 0x0
[1653171612.750586] [er-head:875750:0] mpool.c:236 UCX DEBUG mpool rcache_mp: allocated chunk 0x7ffff7f51008 of 151544 bytes with 1052 elements
[1653171612.750636] [er-head:875750:0] mpool.c:236 UCX DEBUG mpool rcache_mp: allocated chunk 0x7ffff7f2c008 of 151544 bytes with 1052 elements
[1653171612.750794] [er-head:875751:0] async.c:231 UCX DEBUG added async handler 0x671520 [id=14 ref 1] ucs_rcache_invalidate_handler() to hash
[1653171612.750813] [er-head:875751:0] async.c:493 UCX DEBUG listening to async event fd 14 events 0x1 mode thread_spinlock
[1653171612.750921] [er-head:875751:0] module.c:280 UCX DEBUG loading modules for ucm
[1653171612.750931] [er-head:875751:0] ib_md.c:1314 UCX DEBUG mlx5_0: using registration cache
[1653171612.750938] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1653171612.751196] [er-head:875751:0] ib_mlx5dv_md.c:1065 UCX DEBUG mlx5_0: opened DEVX md
[1653171612.751201] [er-head:875751:0] ib_md.c:1541 UCX DEBUG mlx5_0: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1653171612.751407] [er-head:875750:0] mpool.c:236 UCX DEBUG mpool rcache_mp: allocated chunk 0x7ffff7f07008 of 151544 bytes with 1052 elements
[1653171612.751762] [er-head:875750:0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1653171612.751767] [er-head:875750:0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1653171612.751827] [er-head:875750:0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1653171612.755655] [er-head:875751:0] ib_device.c:615 UCX DEBUG mlx5_1: PF sysfs path is '/sys/devices/pci0000:40/0000:40:01.1/0000:41:00.1'
[1653171612.755680] [er-head:875751:0] topo.c:175 UCX DEBUG added sys_dev 1 for bus id 41:00.1
[1653171612.755686] [er-head:875751:0] ib_device.c:646 UCX DEBUG mlx5_1: bdf_name 0000:41:00.1 sys_dev 1
[1653171612.755709] [er-head:875751:0] ib_device.c:670 UCX DEBUG mlx5_1: vendor_id 0x15b3 device_id 4119
[1653171612.756022] [er-head:875751:0] ib_mlx5dv_md.c:710 UCX DEBUG mlx5_1: disable ODP because it's not supported for DEVX QP
[1653171612.756206] [er-head:875751:0] async.c:231 UCX DEBUG added async handler 0x71f7e0 [id=17 ref 1] uct_ib_async_event_handler() to hash
[1653171612.756224] [er-head:875751:0] async.c:493 UCX DEBUG listening to async event fd 17 events 0x1 mode thread_spinlock
[1653171612.756230] [er-head:875751:0] ib_device.c:834 UCX DEBUG initialized device 'mlx5_1' (InfiniBand channel adapter) with 1 ports
[1653171612.756354] [er-head:875751:0] ib_md.c:1592 UCX DEBUG mlx5_1: cuda GPUDirect RDMA is disabled
[1653171612.756362] [er-head:875751:0] ib_md.c:1592 UCX DEBUG mlx5_1: rocm GPUDirect RDMA is disabled
[1653171612.756372] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1653171612.756500] [er-head:875751:0] ib_md.c:1314 UCX DEBUG mlx5_1: using registration cache
[1653171612.756506] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1653171612.756687] [er-head:875751:0] ib_mlx5dv_md.c:1065 UCX DEBUG mlx5_1: opened DEVX md
[1653171612.756692] [er-head:875751:0] ib_md.c:1541 UCX DEBUG mlx5_1: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1653171612.757893] [er-head:875751:0] ucp_context.c:1370 UCX DEBUG closing md cma because it has no selected transport resources
[1653171612.757953] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1653171612.758057] [er-head:875751:0] ucp_context.c:1370 UCX DEBUG closing md knem because it has no selected transport resources
[1653171612.758079] [er-head:875751:0] pgtable.c:618 UCX DEBUG purge empty page table
[1653171612.758089] [er-head:875751:0] mpool.c:154 UCX DEBUG mpool rcache_mp destroyed
[1653171612.758102] [er-head:875751:0] mm_xpmem.c:90 UCX DEBUG xpmem version: 155653
[1653171612.758141] [er-head:875751:0] ucp_context.c:1370 UCX DEBUG closing md xpmem because it has no selected transport resources
[1653171612.758164] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1653171612.758223] [er-head:875751:0] ucp_context.c:1924 UCX DEBUG created ucp context osss-ucx 0x674a20 [2 mds 8 tls] features 0xe tl bitmap 0xff 0x0
[1653171612.758249] [er-head:875751:0] mpool.c:236 UCX DEBUG mpool rcache_mp: allocated chunk 0x7ffff7f51008 of 151544 bytes with 1052 elements
[1653171612.758305] [er-head:875751:0] mpool.c:236 UCX DEBUG mpool rcache_mp: allocated chunk 0x7ffff7f2c008 of 151544 bytes with 1052 elements
[1653171612.758725] [er-head:875751:0] mpool.c:236 UCX DEBUG mpool rcache_mp: allocated chunk 0x7ffff7f07008 of 151544 bytes with 1052 elements
[1653171612.759088] [er-head:875751:0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1653171612.759093] [er-head:875751:0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1653171612.759170] [er-head:875751:0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1653171612.768121] [er-head:875750:0] ib_iface.c:864 UCX DEBUG using pkey[0] 0xffff on mlx5_0:1/IB
[1653171612.769082] [er-head:875750:0] ib_iface.c:1372 UCX DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 91 data_sz 8256
[1653171612.769138] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool rc_recv_desc: align 64, maxelems 4294967295, elemsize 8356
[1653171612.769144] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool rc_send_desc: align 64, maxelems 4294967295, elemsize 8320
[1653171612.769171] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool send-ops-mpool: align 64, maxelems 4294967295, elemsize 48
[1653171612.769857] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool pending-ops: align 1, maxelems 4294967295, elemsize 64
[1653171612.769869] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool rc_verbs_short_desc: align 64, maxelems 4294967295, elemsize 192
[1653171612.770269] [er-head:875750:0] ib_iface.c:982 UCX DEBUG iface=0x987ee0: created RC QP 0x89e2 on mlx5_0:1 TX wr:409 sge:5 inl:124 resp:64 RX wr:0 sge:0 resp:64
[1653171612.770779] [er-head:875750:0] ucp_worker.c:1327 UCX DEBUG created interface[0]=0x987ee0 using rc_verbs/mlx5_0:1 on worker 0x684f10
[1653171612.770980] [er-head:875750:0] ib_iface.c:864 UCX DEBUG using pkey[0] 0xffff on mlx5_0:1/IB
[1653171612.771014] [er-head:875750:0] ib_device.c:1465 UCX DEBUG max IB CQE size is 128
[1653171612.771918] [er-head:875750:0] ib_iface.c:1372 UCX DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 90 data_sz 8256
[1653171612.771936] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool rc_recv_desc: align 64, maxelems 4294967295, elemsize 8356
[1653171612.771941] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool rc_send_desc: align 64, maxelems 4294967295, elemsize 8320
[1653171612.771965] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool send-ops-mpool: align 64, maxelems 4294967295, elemsize 48
[1653171612.772369] [er-head:875750:0] mpool.c:236 UCX DEBUG mpool devx dbrec: allocated chunk 0x721010 of 8176 bytes with 127 elements
[1653171612.772642] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool pending-ops: align 1, maxelems 4294967295, elemsize 64
[1653171612.772812] [er-head:875750:0] ib_mlx5.c:996 UCX DEBUG SL=0 (AR support - no) was selected on mlx5_0:1, SLs with AR support = { <none> }, SLs without AR support = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
d_recv_skb: allocated chunk 0x7ffff7e81018 of 544744 bytes with 128 elements
[1653171612.778822] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 4168
[1653171612.778843] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0xaa6020: adding gid fe80::b8ce:f603:b:fa4 to hash on device mlx5_0 port 1 index 0)
[1653171612.778859] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 1)
[1653171612.778869] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 2)
[1653171612.778879] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 3)
[1653171612.778889] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 4)
[1653171612.778901] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 5)
[1653171612.778911] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 6)
[1653171612.778921] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 7)
[1653171612.778949] [er-head:875751:0] ib_iface.c:1372 UCX DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 90 data_sz 8256
[1653171612.778957] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool rc_recv_desc: align 64, maxelems 4294967295, elemsize 8356
[1653171612.778961] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool rc_send_desc: align 64, maxelems 4294967295, elemsize 8320
[1653171612.778981] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool send-ops-mpool: align 64, maxelems 4294967295, elemsize 48
[1653171612.779176] [er-head:875750:0] timer_wheel.c:40 UCX DEBUG high res timer created log=23 resolution=2712.431042 usec wanted: 2499.999838 usec
[1653171612.779369] [er-head:875751:0] mpool.c:236 UCX DEBUG mpool devx dbrec: allocated chunk 0x721010 of 8176 bytes with 127 elements
[1653171612.779573] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool pending-ops: align 1, maxelems 4294967295, elemsize 64
[1653171612.779779] [er-head:875751:0] ib_mlx5.c:996 UCX DEBUG SL=0 (AR support - no) was selected on mlx5_0:1, SLs with AR support = { <none> }, SLs without AR support = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.780005] [er-head:875750:0] async.c:231 UCX DEBUG added async handler 0x9ab700 [id=21 ref 1] uct_ud_iface_async_handler() to hash
[1653171612.780020] [er-head:875750:0] async.c:493 UCX DEBUG listening to async event fd 21 events 0x5 mode thread_spinlock
[1653171612.780040] [er-head:875750:0] ucp_worker.c:1327 UCX DEBUG created interface[2]=0xaa6020 using ud_verbs/mlx5_0:1 on worker 0x684f10
[1653171612.780144] [er-head:875750:0] ib_iface.c:864 UCX DEBUG using pkey[0] 0xffff on mlx5_0:1/IB
[1653171612.780469] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool mlx5_dm_desc: align 64, maxelems 1, elemsize 64
[1653171612.780475] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool rc_mlx5_atomic_desc: align 64, maxelems 4294967295, elemsize 72
[1653171612.780734] [er-head:875750:0] ib_iface.c:1372 UCX DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 44 data_sz 4096
[1653171612.781282] [er-head:875750:0] ib_iface.c:982 UCX DEBUG iface=0xbe4270: created UD QP 0x8f6e on mlx5_0:1 TX wr:341 sge:6 inl:124 resp:0 RX wr:4096 sge:1 resp:0
[1653171612.781292] [er-head:875750:0] ib_mlx5.c:676 UCX DEBUG tx wq 65536 bytes [bb=64, nwqe=1024] mmio_mode bf_post_mt
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.781623] [er-head:875751:0] async.c:231 UCX DEBUG added async handler 0x683750 [id=20 ref 1] uct_rc_mlx5_devx_iface_event_handler() to hash
[1653171612.781638] [er-head:875751:0] async.c:493 UCX DEBUG listening to async event fd 20 events 0x1 mode thread_spinlock
[1653171612.781661] [er-head:875751:0] ucp_worker.c:1327 UCX DEBUG created interface[1]=0xbfc2c0 using rc_mlx5/mlx5_0:1 on worker 0x684f10
[1653171612.781730] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool ud_recv_skb: align 64, maxelems 4294967295, elemsize 4196
[1653171612.781771] [er-head:875751:0] ib_iface.c:864 UCX DEBUG using pkey[0] 0xffff on mlx5_0:1/IB
[1653171612.782269] [er-head:875750:0] mpool.c:236 UCX DEBUG mpool ud_recv_skb: allocated chunk 0x7ffff00d5018 of 544744 bytes with 128 elements
[1653171612.782275] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 4168
[1653171612.782297] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0xbe4270: adding gid fe80::b8ce:f603:b:fa4 to hash on device mlx5_0 port 1 index 0)
[1653171612.782312] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 1)
[1653171612.782326] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 2)
[1653171612.782339] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 3)
[1653171612.782353] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 4)
[1653171612.782366] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 5)
[1653171612.782380] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 6)
[1653171612.782399] [er-head:875751:0] ib_iface.c:1372 UCX DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 44 data_sz 4096
[1653171612.782393] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 7)
[1653171612.782521] [er-head:875750:0] ib_mlx5.c:996 UCX DEBUG SL=0 (AR support - no) was selected on mlx5_0:1, SLs with AR support = { <none> }, SLs without AR support = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
[1653171612.782574] [er-head:875750:0] timer_wheel.c:40 UCX DEBUG high res timer created log=23 resolution=2712.431042 usec wanted: 2499.999838 usec
[1653171612.782579] [er-head:875750:0] async.c:231 UCX DEBUG added async handler 0x71d4c0 [id=22 ref 1] uct_ud_iface_async_handler() to hash
[1653171612.782593] [er-head:875750:0] async.c:493 UCX DEBUG listening to async event fd 22 events 0x5 mode thread_spinlock
[1653171612.782610] [er-head:875750:0] ucp_worker.c:1327 UCX DEBUG created interface[3]=0xbe4270 using ud_mlx5/mlx5_0:1 on worker 0x684f10
[1653171612.782737] [er-head:875750:0] ib_iface.c:864 UCX DEBUG using pkey[0] 0xffff on mlx5_1:1/IB
[1653171612.782811] [er-head:875751:0] ib_iface.c:982 UCX DEBUG iface=0xaa6020: created UD QP 0x8f6f on mlx5_0:1 TX wr:341 sge:6 inl:124 resp:0 RX wr:4096 sge:1 resp:0
[1653171612.783222] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool ud_recv_skb: align 64, maxelems 4294967295, elemsize 4196
[1653171612.783463] [er-head:875750:0] ib_iface.c:1372 UCX DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 91 data_sz 8256
[1653171612.783471] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool rc_recv_desc: align 64, maxelems 4294967295, elemsize 8356
[1653171612.783476] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool rc_send_desc: align 64, maxelems 4294967295, elemsize 8320
[1653171612.783496] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool send-ops-mpool: align 64, maxelems 4294967295, elemsize 48
[1653171612.783749] [er-head:875751:0] mpool.c:236 UCX DEBUG mpool ud_recv_skb: allocated chunk 0x7ffff7e81018 of 544744 bytes with 128 elements
[1653171612.783755] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 4168
[1653171612.783784] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xaa6020: adding gid fe80::b8ce:f603:b:fa4 to hash on device mlx5_0 port 1 index 0)
[1653171612.783804] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 1)
[1653171612.783818] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 2)
[1653171612.783832] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 3)
[1653171612.783846] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 4)
[1653171612.783859] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 5)
[1653171612.783873] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 6)
[1653171612.783887] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 7)
[1653171612.783908] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool pending-ops: align 1, maxelems 4294967295, elemsize 64
[1653171612.783913] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool rc_verbs_short_desc: align 64, maxelems 4294967295, elemsize 192
[1653171612.783997] [er-head:875751:0] timer_wheel.c:40 UCX DEBUG high res timer created log=23 resolution=2711.837139 usec wanted: 2499.999755 usec
[1653171612.784174] [er-head:875750:0] ib_iface.c:982 UCX DEBUG iface=0x703010: created RC QP 0xc055 on mlx5_1:1 TX wr:409 sge:5 inl:124 resp:64 RX wr:0 sge:0 resp:64
[1653171612.784600] [er-head:875750:0] ucp_worker.c:1327 UCX DEBUG created interface[4]=0x703010 using rc_verbs/mlx5_1:1 on worker 0x684f10
[1653171612.784724] [er-head:875750:0] ib_iface.c:864 UCX DEBUG using pkey[0] 0xffff on mlx5_1:1/IB
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.784792] [er-head:875751:0] async.c:231 UCX DEBUG added async handler 0x9ab700 [id=21 ref 1] uct_ud_iface_async_handler() to hash
[1653171612.784807] [er-head:875751:0] async.c:493 UCX DEBUG listening to async event fd 21 events 0x5 mode thread_spinlock
[1653171612.784827] [er-head:875751:0] ucp_worker.c:1327 UCX DEBUG created interface[2]=0xaa6020 using ud_verbs/mlx5_0:1 on worker 0x684f10
[1653171612.784932] [er-head:875751:0] ib_iface.c:864 UCX DEBUG using pkey[0] 0xffff on mlx5_0:1/IB
[1653171612.785612] [er-head:875751:0] ib_iface.c:1372 UCX DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 44 data_sz 4096
[1653171612.785807] [er-head:875750:0] ib_iface.c:1372 UCX DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 90 data_sz 8256
[1653171612.785815] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool rc_recv_desc: align 64, maxelems 4294967295, elemsize 8356
[1653171612.785819] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool rc_send_desc: align 64, maxelems 4294967295, elemsize 8320
[1653171612.785870] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool send-ops-mpool: align 64, maxelems 4294967295, elemsize 48
[1653171612.786161] [er-head:875751:0] ib_iface.c:982 UCX DEBUG iface=0xbe4270: created UD QP 0x8f70 on mlx5_0:1 TX wr:341 sge:6 inl:124 resp:0 RX wr:4096 sge:1 resp:0
[1653171612.786171] [er-head:875751:0] ib_mlx5.c:676 UCX DEBUG tx wq 65536 bytes [bb=64, nwqe=1024] mmio_mode bf_post_mt
[1653171612.786267] [er-head:875750:0] mpool.c:236 UCX DEBUG mpool devx dbrec: allocated chunk 0xc08010 of 8176 bytes with 127 elements
[1653171612.786458] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool pending-ops: align 1, maxelems 4294967295, elemsize 64
[1653171612.786547] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool ud_recv_skb: align 64, maxelems 4294967295, elemsize 4196
[1653171612.786601] [er-head:875750:0] ib_mlx5.c:996 UCX DEBUG SL=0 (AR support - no) was selected on mlx5_1:1, SLs with AR support = { <none> }, SLs without AR support = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
[1653171612.787053] [er-head:875751:0] mpool.c:236 UCX DEBUG mpool ud_recv_skb: allocated chunk 0x7ffff00d5018 of 544744 bytes with 128 elements
[1653171612.787060] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 4168
[1653171612.787071] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool mlx5_dm_desc: align 64, maxelems 1, elemsize 64
[1653171612.787075] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool rc_mlx5_atomic_desc: align 64, maxelems 4294967295, elemsize 72
[1653171612.787076] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xbe4270: adding gid fe80::b8ce:f603:b:fa4 to hash on device mlx5_0 port 1 index 0)
[1653171612.787086] [er-head:875750:0] async.c:231 UCX DEBUG added async handler 0x9ad4d0 [id=25 ref 1] uct_rc_mlx5_devx_iface_event_handler() to hash
[1653171612.787089] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 1)
[1653171612.787099] [er-head:875750:0] async.c:493 UCX DEBUG listening to async event fd 25 events 0x1 mode thread_spinlock
[1653171612.787098] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 2)
[1653171612.787108] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 3)
[1653171612.787119] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 4)
[1653171612.787111] [er-head:875750:0] ucp_worker.c:1327 UCX DEBUG created interface[5]=0x6a5d50 using rc_mlx5/mlx5_1:1 on worker 0x684f10
[1653171612.787129] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 5)
[1653171612.787139] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 6)
[1653171612.787149] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 7)
[1653171612.787203] [er-head:875750:0] ib_iface.c:864 UCX DEBUG using pkey[0] 0xffff on mlx5_1:1/IB
[1653171612.787285] [er-head:875751:0] ib_mlx5.c:996 UCX DEBUG SL=0 (AR support - no) was selected on mlx5_0:1, SLs with AR support = { <none> }, SLs without AR support = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
[1653171612.787321] [er-head:875751:0] timer_wheel.c:40 UCX DEBUG high res timer created log=23 resolution=2711.837139 usec wanted: 2499.999755 usec
[1653171612.787327] [er-head:875751:0] async.c:231 UCX DEBUG added async handler 0x71d4c0 [id=22 ref 1] uct_ud_iface_async_handler() to hash
[1653171612.787336] [er-head:875751:0] async.c:493 UCX DEBUG listening to async event fd 22 events 0x5 mode thread_spinlock
[1653171612.787349] [er-head:875751:0] ucp_worker.c:1327 UCX DEBUG created interface[3]=0xbe4270 using ud_mlx5/mlx5_0:1 on worker 0x684f10
[1653171612.787454] [er-head:875751:0] ib_iface.c:864 UCX DEBUG using pkey[0] 0xffff on mlx5_1:1/IB
[1653171612.787903] [er-head:875750:0] ib_iface.c:1372 UCX DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 44 data_sz 4096
[1653171612.788093] [er-head:875751:0] ib_iface.c:1372 UCX DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 91 data_sz 8256
[1653171612.788102] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool rc_recv_desc: align 64, maxelems 4294967295, elemsize 8356
[1653171612.788106] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool rc_send_desc: align 64, maxelems 4294967295, elemsize 8320
[1653171612.788128] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool send-ops-mpool: align 64, maxelems 4294967295, elemsize 48
[1653171612.788590] [er-head:875750:0] ib_iface.c:982 UCX DEBUG iface=0x7f2050: created UD QP 0xd851 on mlx5_1:1 TX wr:341 sge:6 inl:124 resp:0 RX wr:4096 sge:1 resp:0
[1653171612.788664] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool pending-ops: align 1, maxelems 4294967295, elemsize 64
[1653171612.788669] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool rc_verbs_short_desc: align 64, maxelems 4294967295, elemsize 192
[1653171612.789500] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool ud_recv_skb: align 64, maxelems 4294967295, elemsize 4196
[1653171612.789513] [er-head:875751:0] ib_iface.c:982 UCX DEBUG iface=0x703010: created RC QP 0xc056 on mlx5_1:1 TX wr:409 sge:5 inl:124 resp:64 RX wr:0 sge:0 resp:64
[1653171612.789998] [er-head:875751:0] ucp_worker.c:1327 UCX DEBUG created interface[4]=0x703010 using rc_verbs/mlx5_1:1 on worker 0x684f10
[1653171612.790086] [er-head:875750:0] mpool.c:236 UCX DEBUG mpool ud_recv_skb: allocated chunk 0x7ffff0050018 of 544744 bytes with 128 elements
[1653171612.790092] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 4168
[1653171612.790109] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0x7f2050: adding gid fe80::b8ce:f603:b:fa5 to hash on device mlx5_1 port 1 index 0)
[1653171612.790118] [er-head:875751:0] ib_iface.c:864 UCX DEBUG using pkey[0] 0xffff on mlx5_1:1/IB
[1653171612.790120] [er-head:875750:0] ud_iface.c:420 UCX DEBUG iface 0[1653171612.794769] [er-head:875750:0] async.c:231 UCX DEBUG added async handler 0x6c1b20 [id=24 ref 1] ucp_worker_iface_async_fd_event() to hash
[1653171612.794775] [er-head:875750:0] async.c:493 UCX DEBUG listening to async event fd 24 events 0x0 mode thread_spinlock
[1653171612.794788] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0x7f2050: adding gid fe80:: to hash on device mlx5_1 port 1 index 4)
[1653171612.794802] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0x7f2050: adding gid fe80:: to hash on device mlx5_1 port 1 index 5)
[1653171612.794816] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0x7f2050: adding gid fe80:: to hash on device mlx5_1 port 1 index 6)
[1653171612.794830] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0x7f2050: adding gid fe80:: to hash on device mlx5_1 port 1 index 7)
[1653171612.794932] [er-head:875751:0] timer_wheel.c:40 UCX DEBUG high res timer created log=23 resolution=2711.837139 usec wanted: 2499.999755 usec
[1653171612.794938] [er-head:875751:0] async.c:231 UCX DEBUG added async handler 0x8ff980 [id=26 ref 1] uct_ud_iface_async_handler() to hash
[1653171612.794951] [er-head:875751:0] async.c:493 UCX DEBUG listening to async event fd 26 events 0x5 mode thread_spinlock
[1653171612.794969] [er-head:875751:0] ucp_worker.c:1327 UCX DEBUG created interface[6]=0x7f2050 using ud_verbs/mlx5_1:1 on worker 0x684f10
[1653171612.795070] [er-head:875751:0] ib_iface.c:864 UCX DEBUG using pkey[0] 0xffff on mlx5_1:1/IB
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.795267] [er-head:875750:0] async.c:231 UCX DEBUG added async handler 0xc07870 [id=28 ref 1] uct_rdmacm_cm_event_handler() to hash
[1653171612.795276] [er-head:875750:0] async.c:493 UCX DEBUG listening to async event fd 28 events 0x1 mode thread_spinlock
[1653171612.795284] [er-head:875750:0] rdmacm_cm.c:979 UCX DEBUG created rdmacm_cm 0xc072b0 with event_channel 0xc07b30 (fd=28)
[1653171612.795302] [er-head:875750:0] tcp_sockcm.c:222 UCX DEBUG created tcp_sockcm 0xf395a0
[1653171612.795318] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool ucp_requests: align 64, maxelems 4294967295, elemsize 280
[1653171612.795322] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool ucp_rkeys: align 64, maxelems 4294967295, elemsize 112
[1653171612.795325] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool ucp_reg_bufs: align 64, maxelems 4294967295, elemsize 8216
[1653171612.795330] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 153
[1653171612.795334] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 1113
[1653171612.795336] [er-head:875750:0] mpool.c:98 UCX DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 8344
[1653171612.795340] [er-head:875750:0] mpool_set.c:120 UCX DEBUG mpool_set:ucp_am_bufs, sizes map 0x80000440, largest size 8255, mpools num 3
[1653171612.795360] [er-head:875750:0] ucp_worker.c:1607 UCX DEBUG worker 0x684f10: using device atomics
[1653171612.795382] [er-head:875750:0] parser.c:1916 UCX INFO UCX_* env variables: UCX_TLS=rc UCX_LOG_LEVEL=debug
[1653171612.795791] [er-head:875751:0] ib_iface.c:1372 UCX DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 44 data_sz 4096
[1653171612.796322] [er-head:875751:0] ib_iface.c:982 UCX DEBUG iface=0xac7060: created UD QP 0xd857 on mlx5_1:1 TX wr:341 sge:6 inl:124 resp:0 RX wr:4096 sge:1 resp:0
[1653171612.796329] [er-head:875751:0] ib_mlx5.c:676 UCX DEBUG tx wq 65536 bytes [bb=64, nwqe=1024] mmio_mode bf_post_mt
[1653171612.796653] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool ud_recv_skb: align 64, maxelems 4294967295, elemsize 4196
[1653171612.797107] [er-head:875751:0] mpool.c:236 UCX DEBUG mpool ud_recv_skb: allocated chunk 0x7fffe2b49018 of 544744 bytes with 128 elements
[1653171612.797113] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 4168
[1653171612.797134] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xac7060: adding gid fe80::b8ce:f603:b:fa5 to hash on device mlx5_1 port 1 index 0)
[1653171612.797149] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xac7060: adding gid fe80:: to hash on device mlx5_1 port 1 index 1)
[1653171612.797163] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xac7060: adding gid fe80:: to hash on device mlx5_1 port 1 index 2)
[1653171612.797175] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xac7060: adding gid fe80:: to hash on device mlx5_1 port 1 index 3)
[1653171612.797188] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xac7060: adding gid fe80:: to hash on device mlx5_1 port 1 index 4)
[1653171612.797202] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xac7060: adding gid fe80:: to hash on device mlx5_1 port 1 index 5)
[1653171612.797214] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xac7060: adding gid fe80:: to hash on device mlx5_1 port 1 index 6)
[1653171612.797227] [er-head:875751:0] ud_iface.c:420 UCX DEBUG iface 0xac7060: adding gid fe80:: to hash on device mlx5_1 port 1 index 7)
[1653171612.797361] [er-head:875751:0] ib_mlx5.c:996 UCX DEBUG SL=0 (AR support - no) was selected on mlx5_1:1, SLs with AR support = { <none> }, SLs without AR support = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
[1653171612.797394] [er-head:875751:0] timer_wheel.c:40 UCX DEBUG high res timer created log=23 resolution=2711.837139 usec wanted: 2499.999755 usec
[1653171612.797399] [er-head:875751:0] async.c:231 UCX DEBUG added async handler 0x6aecf0 [id=27 ref 1] uct_ud_iface_async_handler() to hash
[1653171612.797410] [er-head:875751:0] async.c:493 UCX DEBUG listening to async event fd 27 events 0x5 mode thread_spinlock
[1653171612.797422] [er-head:875751:0] ucp_worker.c:1327 UCX DEBUG created interface[7]=0xac7060 using ud_mlx5/mlx5_1:1 on worker 0x684f10
[1653171612.797428] [er-head:875751:0] ucp_worker.c:1130 UCX DEBUG selected scalable tl bitmap: 0xff 0x0 (8 tls)
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.798990] [er-head:875751:0] async.c:231 UCX DEBUG added async handler 0x9acd10 [id=18 ref 1] ucp_worker_iface_async_fd_event() to hash
[1653171612.799003] [er-head:875751:0] async.c:493 UCX DEBUG listening to async event fd 18 events 0x0 mode thread_spinlock
[1653171612.799048] [er-head:875751:0] async.c:231 UCX DEBUG added async handler 0x9acd50 [id=19 ref 1] ucp_worker_iface_async_fd_event() to hash
[1653171612.799055] [er-head:875751:0] async.c:493 UCX DEBUG listening to async event fd 19 events 0x0 mode thread_spinlock
[1653171612.799079] [er-head:875751:0] async.c:231 UCX DEBUG added async handler 0x6c1ae0 [id=23 ref 1] ucp_worker_iface_async_fd_event() to hash
[1653171612.799086] [er-head:875751:0] async.c:493 UCX DEBUG listening to async event fd 23 events 0x0 mode thread_spinlock
[1653171612.799097] [er-head:875751:0] async.c:231 UCX DEBUG added async handler 0x6c1b20 [id=24 ref 1] ucp_worker_iface_async_fd_event() to hash
[1653171612.799104] [er-head:875751:0] async.c:493 UCX DEBUG listening to async event fd 24 events 0x0 mode thread_spinlock
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.799532] [er-head:875751:0] async.c:231 UCX DEBUG added async handler 0xc07870 [id=28 ref 1] uct_rdmacm_cm_event_handler() to hash
[1653171612.799541] [er-head:875751:0] async.c:493 UCX DEBUG listening to async event fd 28 events 0x1 mode thread_spinlock
[1653171612.799549] [er-head:875751:0] rdmacm_cm.c:979 UCX DEBUG created rdmacm_cm 0xc072b0 with event_channel 0xc07b30 (fd=28)
[1653171612.799568] [er-head:875751:0] tcp_sockcm.c:222 UCX DEBUG created tcp_sockcm 0xf395a0
[1653171612.799582] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool ucp_requests: align 64, maxelems 4294967295, elemsize 280
[1653171612.799586] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool ucp_rkeys: align 64, maxelems 4294967295, elemsize 112
[1653171612.799589] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool ucp_reg_bufs: align 64, maxelems 4294967295, elemsize 8216
[1653171612.799594] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 153
[1653171612.799598] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 1113
[1653171612.799602] [er-head:875751:0] mpool.c:98 UCX DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 8344
[1653171612.799608] [er-head:875751:0] mpool_set.c:120 UCX DEBUG mpool_set:ucp_am_bufs, sizes map 0x80000440, largest size 8255, mpools num 3
[1653171612.799630] [er-head:875751:0] ucp_worker.c:1607 UCX DEBUG worker 0x684f10: using device atomics
[1653171612.799652] [er-head:875751:0] parser.c:1916 UCX INFO UCX_* env variables: UCX_TLS=rc UCX_LOG_LEVEL=debug
[1653171612.800186] [er-head:875750:0] ib_mlx5dv_md.c:202 UCX DEBUG mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM) failed, syndrome 0: Invalid argument
[1653171612.800184] [er-head:875751:0] ib_mlx5dv_md.c:202 UCX DEBUG mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM) failed, syndrome 0: Invalid argument
[1653171612.800284] [er-head:875750:0] ib_mlx5dv_md.c:202 UCX DEBUG mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM) failed, syndrome 0: Invalid argument
[1653171612.800283] [er-head:875751:0] ib_mlx5dv_md.c:202 UCX DEBUG mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM) failed, syndrome 0: Invalid argument
[1653171612.800520] [er-head:875750:0] ucp_ep.c:354 UCX DEBUG created ep 0x7ffff7e3e000 to <no debug data> from api call
[1653171612.800520] [er-head:875751:0] ucp_ep.c:354 UCX DEBUG created ep 0x7ffff7e3e000 to <no debug data> from api call
[1653171612.800574] [er-head:875750:0] ucp_worker.c:1782 UCX INFO ep_cfg[0]: rma(rc_mlx5/mlx5_0:1) amo(rc_mlx5/mlx5_0:1)
[1653171612.800574] [er-head:875751:0] ucp_worker.c:1782 UCX INFO ep_cfg[0]: rma(rc_mlx5/mlx5_0:1) amo(rc_mlx5/mlx5_0:1)
[1653171612.800589] [er-head:875750:0] wireup.c:1071 UCX DEBUG ep 0x7ffff7e3e000: am_lane 0 wireup_msg_lane 0 cm_lane <none> keepalive_lane <none> reachable_mds 0x3
[1653171612.800596] [er-head:875750:0] wireup.c:1094 UCX DEBUG ep 0x7ffff7e3e000: lane[0]: 1:rc_mlx5/mlx5_0:1.0 md[0] -> addr[1].md[0]/ib/sysdev[255] amo#0 am am_bw#0 wireup
[1653171612.800589] [er-head:875751:0] wireup.c:1071 UCX DEBUG ep 0x7ffff7e3e000: am_lane 0 wireup_msg_lane 0 cm_lane <none> keepalive_lane <none> reachable_mds 0x3
[1653171612.800596] [er-head:875751:0] wireup.c:1094 UCX DEBUG ep 0x7ffff7e3e000: lane[0]: 1:rc_mlx5/mlx5_0:1.0 md[0] -> addr[1].md[0]/ib/sysdev[255] amo#0 am am_bw#0 wireup
[1653171612.801320] [er-head:875750:0] rc_ep.c:165 UCX DEBUG created rc ep 0x990ef0
[1653171612.801320] [er-head:875751:0] rc_ep.c:165 UCX DEBUG created rc ep 0x990ef0
[1653171612.801331] [er-head:875750:0] wireup_ep.c:543 UCX DEBUG ep 0x7ffff7e3e000: wireup_ep 0x7244a0 created next_ep 0x990ef0 to <no debug data> using rc_mlx5/mlx5_0:1
[1653171612.801330] [er-head:875751:0] wireup_ep.c:543 UCX DEBUG ep 0x7ffff7e3e000: wireup_ep 0x7244a0 created next_ep 0x990ef0 to <no debug data> using rc_mlx5/mlx5_0:1
[1653171612.801347] [er-head:875750:0] ib_iface.c:750 UCX DEBUG iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.801347] [er-head:875751:0] ib_iface.c:750 UCX DEBUG iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.801359] [er-head:875750:0] ib_iface.c:750 UCX DEBUG iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.801359] [er-head:875751:0] ib_iface.c:750 UCX DEBUG iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.801375] [er-head:875750:0] ud_ep.c:404 UCX DEBUG created ep ep=0x9a9750 iface=0xbe4270 id=0
[1653171612.801375] [er-head:875751:0] ud_ep.c:404 UCX DEBUG created ep ep=0x9a9750 iface=0xbe4270 id=0
[1653171612.801382] [er-head:875751:0] ud_ep.c:508 UCX DEBUG mlx5_0:1/IB lid 1 qpn 0x8f70 epid 0 ep 0x9a9750 connected to IFACE lid 1 fe80::pkey 0xffff qpn 0x8f6e
[1653171612.801387] [er-head:875751:0] ib_iface.c:750 UCX DEBUG iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.801382] [er-head:875750:0] ud_ep.c:508 UCX DEBUG mlx5_0:1/IB lid 1 qpn 0x8f6e epid 0 ep 0x9a9750 connected to IFACE lid 1 fe80::pkey 0xffff qpn 0x8f6e
[1653171612.801387] [er-head:875750:0] ib_iface.c:750 UCX DEBUG iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.801392] [er-head:875750:0] ib_iface.c:750 UCX DEBUG iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.801393] [er-head:875751:0] ib_iface.c:750 UCX DEBUG iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.802506] [er-head:875750:0] mpool.c:236 UCX DEBUG mpool ud_tx_skb: allocated chunk 0x7fffe2400018 of 6291432 bytes with 1489 elements
[1653171612.802825] [er-head:875751:0] mpool.c:236 UCX DEBUG mpool ud_tx_skb: allocated chunk 0x7fffe2400018 of 6291432 bytes with 1489 elements
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.803508] [er-head:875750:0] async.c:231 UCX DEBUG added async handler 0xfdbb50 [id=1000014 ref 1] uct_ud_iface_timer() to hash
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.803571] [er-head:875750:0] wireup_ep.c:317 UCX DEBUG ep 0x7ffff7e3e000: wireup_ep 0x7244a0 created aux_ep 0x9a9750 to <no debug data> using ud_mlx5/mlx5_0:1
[1653171612.803581] [er-head:875751:0] async.c:231 UCX DEBUG added async handler 0xfdbb50 [id=1000014 ref 1] uct_ud_iface_timer() to hash
[1653171612.803622] [er-head:875751:0] wireup_ep.c:317 UCX DEBUG ep 0x7ffff7e3e000: wireup_ep 0x7244a0 created aux_ep 0x9a9750 to <no debug data> using ud_mlx5/mlx5_0:1
[1653171612.808762] [er-head:875751:0] mpool.c:236 UCX DEBUG mpool rc_recv_desc: allocated chunk 0x7fffd9800018 of 39845864 bytes with 4752 elements
[1653171612.809063] [er-head:875751:0] wireup.c:1442 UCX DEBUG ep 0x7ffff7e3e000: send wireup request (flags=0x40)
[1653171612.809103] [er-head:875751:0] mpool.c:236 UCX DEBUG mpool ucp_rkeys: allocated chunk 0x97d390 of 16472 bytes with 128 elements
[1653171612.809124] [er-head:875751:0] ucp_ep.c:354 UCX DEBUG created ep 0x7ffff7e3e058 to <no debug data> from api call
[1653171612.809165] [er-head:875751:0] wireup.c:1071 UCX DEBUG ep 0x7ffff7e3e058: am_lane 0 wireup_msg_lane 0 cm_lane <none> keepalive_lane <none> reachable_mds 0x3
[1653171612.809172] [er-head:875751:0] wireup.c:1094 UCX DEBUG ep 0x7ffff7e3e058: lane[0]: 1:rc_mlx5/mlx5_0:1.0 md[0] -> addr[1].md[0]/ib/sysdev[255] amo#0 am am_bw#0 wireup
[1653171612.809629] [er-head:875750:0] mpool.c:236 UCX DEBUG mpool rc_recv_desc: allocated chunk 0x7fffd9800018 of 39845864 bytes with 4752 elements
[1653171612.809806] [er-head:875751:0] rc_ep.c:165 UCX DEBUG created rc ep 0x9a9640
[1653171612.809814] [er-head:875751:0] wireup_ep.c:543 UCX DEBUG ep 0x7ffff7e3e058: wireup_ep 0x9a2150 created next_ep 0x9a9640 to <no debug data> using rc_mlx5/mlx5_0:1
[1653171612.809826] [er-head:875751:0] ib_iface.c:750 UCX DEBUG iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.809833] [er-head:875751:0] ib_iface.c:750 UCX DEBUG iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.809840] [er-head:875751:0] ud_ep.c:404 UCX DEBUG created ep ep=0x602ff0 iface=0xbe4270 id=1
[1653171612.809849] [er-head:875751:0] ud_ep.c:508 UCX DEBUG mlx5_0:1/IB lid 1 qpn 0x8f70 epid 1 ep 0x602ff0 connected to IFACE lid 1 fe80::pkey 0xffff qpn 0x8f70
[1653171612.809854] [er-head:875751:0] ib_iface.c:750 UCX DEBUG iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.809859] [er-head:875751:0] ib_iface.c:750 UCX DEBUG iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.809868] [er-head:875751:0] wireup_ep.c:317 UCX DEBUG ep 0x7ffff7e3e058: wireup_ep 0x9a2150 created aux_ep 0x602ff0 to <no debug data> using ud_mlx5/mlx5_0:1
[1653171612.809875] [er-head:875751:0] wireup.c:1442 UCX DEBUG ep 0x7ffff7e3e058: send wireup request (flags=0x200080)
[1653171612.809992] [er-head:875750:0] wireup.c:1442 UCX DEBUG ep 0x7ffff7e3e000: send wireup request (flags=0x200080)
[1653171612.810029] [er-head:875750:a] ib_iface.c:750 UCX DEBUG iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.810041] [er-head:875750:a] ud_ep.c:817 UCX DEBUG simultaneous CREQ ep=0x9a9750(iface=0xbe4270 conn_sn=0 ep_id=0, dest_ep_id=0 rx_psn=1)
[1653171612.809999] [er-head:875751:a] ib_iface.c:750 UCX DEBUG iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.810012] [er-head:875751:a] ud_ep.c:817 UCX DEBUG simultaneous CREQ ep=0x602ff0(iface=0xbe4270 conn_sn=0 ep_id=1, dest_ep_id=1 rx_psn=1)
[1653171612.812946] [er-head:875751:a] mpool.c:236 UCX DEBUG mpool ud_recv_skb: allocated chunk 0x7fffe0e00018 of 20971496 bytes with 4964 elements
[1653171612.813329] [er-head:875750:a] mpool.c:236 UCX DEBUG mpool ud_recv_skb: allocated chunk 0x7fffe0e00018 of 20971496 bytes with 4964 elements
fff7e3e058: send wireup request (flags=0x40)
[1653171612.814394] [er-head:875751:0] mpool.c:236 UCX DEBUG mpool ucp_requests: allocated chunk 0x950334 of 41044 bytes with 128 elements
[1653171612.814399] [er-head:875750:0] mpool.c:236 UCX DEBUG mpool ucp_requests: allocated chunk 0x950334 of 41044 bytes with 128 elements
[1653171612.814425] [er-head:875751:0] flush.c:310 UCX DEBUG flush_worker ep 0x7ffff7e3e000
[1653171612.814432] [er-head:875750:0] ib_iface.c:750 UCX DEBUG iface 0xbfc2c0: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.814436] [er-head:875751:0] flush.c:310 UCX DEBUG flush_worker ep 0x7ffff7e3e058
[1653171612.814448] [er-head:875751:0] wireup_ep.c:471 UCX DEBUG ep 0x7ffff7e3e058: destroy wireup ep 0x9a2150
[1653171612.814456] [er-head:875751:0] ucp_ep.c:1270 UCX DEBUG ep 0x7ffff7e3e058: unprogress iface 0xbe4270 ud_mlx5/mlx5_0:1
[1653171612.814463] [er-head:875751:0] ud_ep.c:1767 UCX DEBUG ep 0x602ff0: disconnect
[1653171612.814654] [er-head:875750:0] rc_mlx5_devx.c:444 UCX DEBUG connected rc devx qp 0x89ea on mlx5_0:1/IB to lid 1(+0) sl 0 remote_qp 0x89ea mtu 4096 timer 18x7 rnr 13x7 rd_atom 16
[1653171612.814670] [er-head:875750:0] flush.c:310 UCX DEBUG flush_worker ep 0x7ffff7e3e000
[1653171612.814682] [er-head:875750:0] flush.c:310 UCX DEBUG flush_worker ep 0x7ffff7e3e058
[1653171612.814688] [er-head:875750:0] wireup_ep.c:471 UCX DEBUG ep 0x7ffff7e3e000: destroy wireup ep 0x7244a0
[1653171612.814692] [er-head:875751:0] ib_iface.c:750 UCX DEBUG iface 0xbfc2c0: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.814693] [er-head:875750:0] ucp_ep.c:1270 UCX DEBUG ep 0x7ffff7e3e000: unprogress iface 0xbe4270 ud_mlx5/mlx5_0:1
[1653171612.814697] [er-head:875750:0] ud_ep.c:1767 UCX DEBUG ep 0x9a9750: disconnect
[1653171612.814911] [er-head:875751:0] rc_mlx5_devx.c:444 UCX DEBUG connected rc devx qp 0x89e7 on mlx5_0:1/IB to lid 1(+0) sl 0 remote_qp 0x89ee mtu 4096 timer 18x7 rnr 13x7 rd_atom 16
[1653171612.814922] [er-head:875750:0] ib_iface.c:750 UCX DEBUG iface 0xbfc2c0: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.815118] [er-head:875750:0] rc_mlx5_devx.c:444 UCX DEBUG connected rc devx qp 0x89ee on mlx5_0:1/IB to lid 1(+0) sl 0 remote_qp 0x89e7 mtu 4096 timer 18x7 rnr 13x7 rd_atom 16
[1653171612.815124] [er-head:875750:0] wireup_ep.c:471 UCX DEBUG ep 0x7ffff7e3e058: destroy wireup ep 0x9a2150
[1653171612.815128] [er-head:875750:0] ucp_ep.c:1270 UCX DEBUG ep 0x7ffff7e3e058: unprogress iface 0xbe4270 ud_mlx5/mlx5_0:1
[1653171612.815135] [er-head:875750:0] async.c:156 UCX DEBUG removed async handler 0xfdbb50 [id=1000014 ref 1] uct_ud_iface_timer() from hash
[1653171612.815139] [er-head:875750:0] async.c:546 UCX DEBUG removing async handler 0xfdbb50 [id=1000014 ref 1] uct_ud_iface_timer()
[1653171612.815150] [er-head:875750:0] async.c:171 UCX DEBUG release async handler 0xfdbb50 [id=1000014 ref 0] uct_ud_iface_timer()
[1653171612.815163] [er-head:875750:0] ud_ep.c:1767 UCX DEBUG ep 0x7fffdc000e10: disconnect
[1653171612.816615] [er-head:875750:0] mpool.c:236 UCX DEBUG mpool rc_send_desc: allocated chunk 0x7fffe0200018 of 10485736 bytes with 1260 elements
[1653171612.816655] [er-head:875751:0] wireup_ep.c:471 UCX DEBUG ep 0x7ffff7e3e000: destroy wireup ep 0x7244a0
[1653171612.816664] [er-head:875751:0] ucp_ep.c:1270 UCX DEBUG ep 0x7ffff7e3e000: unprogress iface 0xbe4270 ud_mlx5/mlx5_0:1
[1653171612.816676] [er-head:875751:0] async.c:156 UCX DEBUG removed async handler 0xfdbb50 [id=1000014 ref 1] uct_ud_iface_timer() from hash
[1653171612.816684] [er-head:875751:0] async.c:546 UCX DEBUG removing async handler 0xfdbb50 [id=1000014 ref 1] uct_ud_iface_timer()
[1653171612.816697] [er-head:875751:0] async.c:171 UCX DEBUG release async handler 0xfdbb50 [id=1000014 ref 0] uct_ud_iface_timer()
[1653171612.816716] [er-head:875751:0] ud_ep.c:1767 UCX DEBUG ep 0x9a9750: disconnect
[1653171612.817172] [er-head:875751:0] mpool.c:236 UCX DEBUG mpool rc_mlx5_atomic_desc: allocated chunk 0x7ffff7e1b018 of 135144 bytes with 1055 elements
<hang>
@tonycurtis can you pls post the hanging process backtrace and the reproducer code (a.out source)?
I'm going to investigate more myself, something weird going on
Looks like a relaxed order issue I'm not handling correctly.
Looks like relaxed-order is disabled on this cluster (admin ran query command as listed here: https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1280442391/AMD+2nd+Gen+EPYC+CPU+Tuning+Guide+for+InfiniBand+HPC?preview=/1280442391/1280409615/image-20200206-071308.png#Relaxed-Ordering). However, UCX sees the AMD processor, and enables relaxed-order. Override with UCX_IB_PCI_RELAXED_ORDERING=no
Describe the bug
Was using 1.12.1 on an AMD x86 cluster, then switched to github head, programs working with IB transports (rc, dc) now hang. Single node runs with e.g. xpmem still work.
Steps to Reproduce
ucx_info -v
)Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
ofed_info -s
ibstat
oribv_devinfo -vv
command