openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.16k stars 427 forks source link

AMD x86 cluster: git HEAD hanging now but 1.12.1 working #8216

Open tonycurtis opened 2 years ago

tonycurtis commented 2 years ago

Describe the bug

Was using 1.12.1 on an AMD x86 cluster, then switched to github head, programs working with IB transports (rc, dc) now hang. Single node runs with e.g. xpmem still work.

Steps to Reproduce

# Version 1.14.0
# Git branch 'master', revision 5f611a6
# Configured with: --prefix=/home/arcurtis/opt/x86_64/ucx/git --enable-mt --enable-cma --enable-logging --enable-frame-pointer --enable-optimizations --with-knem=/opt/knem-1.1.4.90mlnx1 --with-xpmem

Setup and versions

er-head[1](~) cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)
er-head[1](~) uname -a
Linux er-head.usrc 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Mon Jan 17 07:06:06 EST 2022 x86_64 x86_64 x86_64 GNU/Linux
MLNX_OFED_LINUX-5.5-1.0.3.2:
CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.32.1010
        Hardware version: 0
        Node GUID: 0xb8cef603000b0fa4
        System image GUID: 0xb8cef603000b0fa4
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e84a
                Port GUID: 0xb8cef603000b0fa4
                Link layer: InfiniBand
CA 'mlx5_1'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.32.1010
        Hardware version: 0
        Node GUID: 0xb8cef603000b0fa5
        System image GUID: 0xb8cef603000b0fa4
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 15
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0xb8cef603000b0fa5
                Link layer: InfiniBand
yosefe commented 2 years ago

@tonycurtis can you pls prove more details on the hang including command line output and backtrace?

tonycurtis commented 2 years ago

Here's the UCX_LOG_LEVEL=info output:

[1653170535.008386] [er02:593424:0]     ucp_context.c:1855 UCX  INFO  Version 1.14.0 (loaded from /home/arcurtis/opt/x86_64/ucx/git/lib/libucp.so.0)
[1653170535.009223] [er04:601925:0]     ucp_context.c:1855 UCX  INFO  Version 1.14.0 (loaded from /home/arcurtis/opt/x86_64/ucx/git/lib/libucp.so.0)
[1653170535.020528] [er01:595504:0]     ucp_context.c:1855 UCX  INFO  Version 1.14.0 (loaded from /home/arcurtis/opt/x86_64/ucx/git/lib/libucp.so.0)
[1653170535.170537] [er04:601925:0]          parser.c:1916 UCX  INFO  UCX_* env variable: UCX_LOG_LEVEL=info
[1653170535.185026] [er02:593424:0]          parser.c:1916 UCX  INFO  UCX_* env variable: UCX_LOG_LEVEL=info
[1653170535.195224] [er01:595504:0]          parser.c:1916 UCX  INFO  UCX_* env variable: UCX_LOG_LEVEL=info
[1653170536.392986] [er03:589768:0]     ucp_context.c:1855 UCX  INFO  Version 1.14.0 (loaded from /home/arcurtis/opt/x86_64/ucx/git/lib/libucp.so.0)
[1653170536.546282] [er03:589768:0]          parser.c:1916 UCX  INFO  UCX_* env variable: UCX_LOG_LEVEL=info
[1653170536.552579] [er02:593424:0]      ucp_worker.c:1782 UCX  INFO    ep_cfg[0]: rma(rc_mlx5/mlx5_0:1) amo(rc_mlx5/mlx5_0:1)
[1653170536.553352] [er04:601925:0]      ucp_worker.c:1782 UCX  INFO    ep_cfg[0]: rma(rc_mlx5/mlx5_0:1) amo(rc_mlx5/mlx5_0:1)
[1653170536.553669] [er03:589768:0]      ucp_worker.c:1782 UCX  INFO    ep_cfg[0]: rma(rc_mlx5/mlx5_0:1) amo(rc_mlx5/mlx5_0:1)
[1653170536.554232] [er01:595504:0]      ucp_worker.c:1782 UCX  INFO    ep_cfg[0]: rma(xpmem/memory) amo(rc_mlx5/mlx5_0:1)
[1653170536.563677] [er03:589768:0]      ucp_worker.c:1782 UCX  INFO    ep_cfg[1]: rma(xpmem/memory) amo(rc_mlx5/mlx5_0:1)
[1653170536.568505] [er02:593424:0]      ucp_worker.c:1782 UCX  INFO    ep_cfg[1]: rma(xpmem/memory) amo(rc_mlx5/mlx5_0:1)
[1653170536.569692] [er01:595504:0]      ucp_worker.c:1782 UCX  INFO    ep_cfg[1]: rma(rc_mlx5/mlx5_0:1) amo(rc_mlx5/mlx5_0:1)
[1653170536.571107] [er04:601925:0]      ucp_worker.c:1782 UCX  INFO    ep_cfg[1]: rma(xpmem/memory) amo(rc_mlx5/mlx5_0:1)
<and then nothing>

No problem with 1 PE/rank, but as soon as 2 with rc/dc in the mix, hangs

tonycurtis commented 2 years ago

Here's the UCX_LOG_LEVEL=debug output for 2 PEs:

$ UCX_TLS=rc oshrun -n 2 ./a.out

[1653171612.558177] [er-head:875750:0]           debug.c:1146 UCX  DEBUG using signal stack 0x7ffff7fbb000 size 141824
[1653171612.558821] [er-head:875751:0]           debug.c:1146 UCX  DEBUG using signal stack 0x7ffff7fbb000 size 141824
[1653171612.571083] [er-head:875750:0]             cpu.c:338  UCX  DEBUG measured tsc frequency 3092.653 MHz after 1.00 ms
[1653171612.571106] [er-head:875750:0]            init.c:118  UCX  DEBUG /home/arcurtis/opt/x86_64/ucx/git/lib/libucs.so.0 loaded at 0x7ffff5af7000
[1653171612.571124] [er-head:875750:0]            init.c:120  UCX  DEBUG cmd line: ./a.out
[1653171612.571138] [er-head:875750:0]          module.c:72   UCX  DEBUG ucs library path: /home/arcurtis/opt/x86_64/ucx/git/lib/libucs.so.0
[1653171612.571146] [er-head:875750:0]          module.c:280  UCX  DEBUG loading modules for ucs
[1653171612.572388] [er-head:875751:0]             cpu.c:338  UCX  DEBUG measured tsc frequency 3093.330 MHz after 0.33 ms
[1653171612.572408] [er-head:875751:0]            init.c:118  UCX  DEBUG /home/arcurtis/opt/x86_64/ucx/git/lib/libucs.so.0 loaded at 0x7ffff5af7000
[1653171612.572422] [er-head:875751:0]            init.c:120  UCX  DEBUG cmd line: ./a.out
[1653171612.572433] [er-head:875751:0]          module.c:72   UCX  DEBUG ucs library path: /home/arcurtis/opt/x86_64/ucx/git/lib/libucs.so.0
[1653171612.572439] [er-head:875751:0]          module.c:280  UCX  DEBUG loading modules for ucs
[1653171612.580809] [er-head:875750:0]            time.c:22   UCX  DEBUG arch clock frequency: 3092653000.00 Hz
[1653171612.580812] [er-head:875751:0]            time.c:22   UCX  DEBUG arch clock frequency: 3093330303.03 Hz
[1653171612.580849] [er-head:875750:0]     ucp_context.c:1855 UCX  INFO  Version 1.14.0 (loaded from /home/arcurtis/opt/x86_64/ucx/git/lib/libucp.so.0)
[1653171612.580857] [er-head:875750:0]     ucp_context.c:1636 UCX  DEBUG estimated number of endpoints is 2
[1653171612.580861] [er-head:875750:0]     ucp_context.c:1643 UCX  DEBUG estimated number of endpoints per node is 2
[1653171612.580865] [er-head:875750:0]     ucp_context.c:1650 UCX  DEBUG estimated bcopy bandwidth is 5251268608.000000
[1653171612.580848] [er-head:875751:0]     ucp_context.c:1855 UCX  INFO  Version 1.14.0 (loaded from /home/arcurtis/opt/x86_64/ucx/git/lib/libucp.so.0)
[1653171612.580856] [er-head:875751:0]     ucp_context.c:1636 UCX  DEBUG estimated number of endpoints is 2
[1653171612.580859] [er-head:875751:0]     ucp_context.c:1643 UCX  DEBUG estimated number of endpoints per node is 2
[1653171612.580865] [er-head:875751:0]     ucp_context.c:1650 UCX  DEBUG estimated bcopy bandwidth is 5251268608.000000
[1653171612.580875] [er-head:875750:0]     ucp_context.c:1704 UCX  DEBUG allocation method[0] is md 'sysv'
[1653171612.580880] [er-head:875750:0]     ucp_context.c:1704 UCX  DEBUG allocation method[1] is md 'posix'
[1653171612.580886] [er-head:875750:0]     ucp_context.c:1716 UCX  DEBUG allocation method[2] is 'huge'
[1653171612.580890] [er-head:875750:0]     ucp_context.c:1716 UCX  DEBUG allocation method[3] is 'thp'
[1653171612.580893] [er-head:875750:0]     ucp_context.c:1704 UCX  DEBUG allocation method[4] is md '*'
[1653171612.580875] [er-head:875751:0]     ucp_context.c:1704 UCX  DEBUG allocation method[0] is md 'sysv'
[1653171612.580878] [er-head:875751:0]     ucp_context.c:1704 UCX  DEBUG allocation method[1] is md 'posix'
[1653171612.580885] [er-head:875751:0]     ucp_context.c:1716 UCX  DEBUG allocation method[2] is 'huge'
[1653171612.580889] [er-head:875751:0]     ucp_context.c:1716 UCX  DEBUG allocation method[3] is 'thp'
[1653171612.580891] [er-head:875751:0]     ucp_context.c:1704 UCX  DEBUG allocation method[4] is md '*'
[1653171612.580895] [er-head:875751:0]     ucp_context.c:1716 UCX  DEBUG allocat modules for uct_ib
[1653171612.583846] [er-head:875750:0]        mm_xpmem.c:90   UCX  DEBUG xpmem version: 155653
[1653171612.583861] [er-head:875751:0]        mm_xpmem.c:90   UCX  DEBUG xpmem version: 155653
[1653171612.583901] [er-head:875750:0]     ucp_context.c:1370 UCX  DEBUG closing md self because it has no selected transport resources
[1653171612.583903] [er-head:875751:0]     ucp_context.c:1370 UCX  DEBUG closing md self because it has no selected transport resources
[1653171612.584355] [er-head:875750:0]     ucp_context.c:1370 UCX  DEBUG closing md tcp because it has no selected transport resources
[1653171612.584355] [er-head:875751:0]     ucp_context.c:1370 UCX  DEBUG closing md tcp because it has no selected transport resources
[1653171612.584385] [er-head:875751:0]     ucp_context.c:1370 UCX  DEBUG closing md sysv because it has no selected transport resources
[1653171612.584392] [er-head:875750:0]     ucp_context.c:1370 UCX  DEBUG closing md sysv because it has no selected transport resources
[1653171612.584433] [er-head:875751:0]     ucp_context.c:1370 UCX  DEBUG closing md posix because it has no selected transport resources
[1653171612.584441] [er-head:875750:0]     ucp_context.c:1370 UCX  DEBUG closing md posix because it has no selected transport resources
[1653171612.590555] [er-head:875750:0]       ib_device.c:615  UCX  DEBUG mlx5_0: PF sysfs path is '/sys/devices/pci0000:40/0000:40:01.1/0000:41:00.0'
[1653171612.590569] [er-head:875750:0]            topo.c:175  UCX  DEBUG added sys_dev 0 for bus id 41:00.0
[1653171612.590576] [er-head:875750:0]       ib_device.c:646  UCX  DEBUG mlx5_0: bdf_name 0000:41:00.0 sys_dev 0
[1653171612.590599] [er-head:875750:0]       ib_device.c:670  UCX  DEBUG mlx5_0: vendor_id 0x15b3 device_id 4119
[1653171612.590958] [er-head:875750:0]    ib_mlx5dv_md.c:710  UCX  DEBUG mlx5_0: disable ODP because it's not supported for DEVX QP
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.593199] [er-head:875750:0]           async.c:231  UCX  DEBUG added async handler 0x670020 [id=10 ref 1] uct_ib_async_event_handler() to hash
[1653171612.593294] [er-head:875750:0]           async.c:493  UCX  DEBUG listening to async event fd 10 events 0x1 mode thread_spinlock
[1653171612.593307] [er-head:875750:0]       ib_device.c:834  UCX  DEBUG initialized device 'mlx5_0' (InfiniBand channel adapter) with 1 ports
[1653171612.593510] [er-head:875750:0]           ib_md.c:1592 UCX  DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1653171612.593517] [er-head:875750:0]           ib_md.c:1592 UCX  DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1653171612.593535] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1653171612.599511] [er-head:875751:0]       ib_device.c:615  UCX  DEBUG mlx5_0: PF sysfs path is '/sys/devices/pci0000:40/0000:40:01.1/0000:41:00.0'
[1653171612.599535] [er-head:875751:0]            topo.c:175  UCX  DEBUG added sys_dev 0 for bus id 41:00.0
[1653171612.599540] [er-head:875751:0]       ib_device.c:646  UCX  DEBUG mlx5_0: bdf_name 0000:41:00.0 sys_dev 0
[1653171612.599561] [er-head:875751:0]       ib_device.c:670  UCX  DEBUG mlx5_0: vendor_id 0x15b3 device_id 4119
[1653171612.599871] [er-head:875751:0]    ib_mlx5dv_md.c:710  UCX  DEBUG mlx5_0: disable ODP because it's not supported for DEVX QP
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.602034] [er-head:875751:0]           async.c:231  UCX  DEBUG added async handler 0x670020 [id=10 ref 1] uct_ib_async_event_handler() to hash
[1653171612.602137] [er-head:875751:0]           async.c:493  UCX  DEBUG listening to async event fd 10 events 0x1 mode thread_spinlock
[1653171612.602149] [er-head:875751:0]       ib_device.c:834  UCX  DEBUG initialized device 'mlx5_0' (InfiniBand channel adapter) with 1 ports
[1653171612.602271] [er-head:875751:0]           ib_md.c:1592 UCX  DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1653171612.602278] [er-head:875751:0]           ib_md.c:1592 UCX  DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1653171612.602293] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1653171612.742054] [er-head:875750:0]           async.c:231  UCX  DEBUG added async handler 0x671520 [id=14 ref 1] ucs_rcache_invalidate_handler() to hash
[1653171612.742082] [er-head:875750:0]           async.c:493  UCX  DEBUG listening to async event fd 14 events 0x1 mode thread_spinlock
[1653171612.742224] [er-head:875750:0]          module.c:280  UCX  DEBUG loading modules for ucm
[1653171612.742239] [er-head:875750:0]           ib_md.c:1314 UCX  DEBUG mlx5_0: using registration cache
[1653171612.742247] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1653171612.742545] [er-head:875750:0]    ib_mlx5dv_md.c:1065 UCX  DEBUG mlx5_0: opened DEVX md
[1653171612.742550] [er-head:875750:0]           ib_md.c:1541 UCX  DEBUG mlx5_0: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1653171612.748182] [er-head:875750:0]       ib_device.c:615  UCX  DEBUG mlx5_1: PF sysfs path is '/sys/devices/pci0000:40/0000:40:01.1/0000:41:00.1'
[1653171612.748192] [er-head:875750:0]            topo.c:175  UCX  DEBUG added sys_dev 1 for bus id 41:00.1
[1653171612.748198] [er-head:875750:0]       ib_device.c:646  UCX  DEBUG mlx5_1: bdf_name 0000:41:00.1 sys_dev 1
[1653171612.748218] [er-head:875750:0]       ib_device.c:670  UCX  DEBUG mlx5_1: vendor_id 0x15b3 device_id 4119
[1653171612.748543] [er-head:875750:0]    ib_mlx5dv_md.c:710  UCX  DEBUG mlx5_1: disable ODP because it's not supported for DEVX QP
[1653171612.748722] [er-head:875750:0]           async.c:231  UCX  DEBUG added async handler 0x71f7e0 [id=17 ref 1] uct_ib_async_event_handler() to hash
[1653171612.748732] [er-head:875750:0]           async.c:493  UCX  DEBUG listening to async event fd 17 events 0x1 mode thread_spinlock
[1653171612.748738] [er-head:875750:0]       ib_device.c:834  UCX  DEBUG initialized device 'mlx5_1' (InfiniBand channel adapter) with 1 ports
[1653171612.748878] [er-head:875750:0]           ib_md.c:1592 UCX  DEBUG mlx5_1: cuda GPUDirect RDMA is disabled
[1653171612.748884] [er-head:875750:0]           ib_md.c:1592 UCX  DEBUG mlx5_1: rocm GPUDirect RDMA is disabled
[1653171612.748890] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1653171612.748965] [er-head:875750:0]           ib_md.c:1314 UCX  DEBUG mlx5_1: using registration cache
[1653171612.748970] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1653171612.749151] [er-head:875750:0]    ib_mlx5dv_md.c:1065 UCX  DEBUG mlx5_1: opened DEVX md
[1653171612.749155] [er-head:875750:0]           ib_md.c:1541 UCX  DEBUG mlx5_1: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1653171612.750308] [er-head:875750:0]     ucp_context.c:1370 UCX  DEBUG closing md cma because it has no selected transport resources
[1653171612.750345] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1653171612.750422] [er-head:875750:0]     ucp_context.c:1370 UCX  DEBUG closing md knem because it has no selected transport resources
[1653171612.750441] [er-head:875750:0]         pgtable.c:618  UCX  DEBUG purge empty page table
[1653171612.750447] [er-head:875750:0]           mpool.c:154  UCX  DEBUG mpool rcache_mp destroyed
[1653171612.750459] [er-head:875750:0]        mm_xpmem.c:90   UCX  DEBUG xpmem version: 155653
[1653171612.750491] [er-head:875750:0]     ucp_context.c:1370 UCX  DEBUG closing md xpmem because it has no selected transport resources
[1653171612.750511] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1653171612.750566] [er-head:875750:0]     ucp_context.c:1924 UCX  DEBUG created ucp context osss-ucx 0x674a20 [2 mds 8 tls] features 0xe tl bitmap 0xff 0x0
[1653171612.750586] [er-head:875750:0]           mpool.c:236  UCX  DEBUG mpool rcache_mp: allocated chunk 0x7ffff7f51008 of 151544 bytes with 1052 elements
[1653171612.750636] [er-head:875750:0]           mpool.c:236  UCX  DEBUG mpool rcache_mp: allocated chunk 0x7ffff7f2c008 of 151544 bytes with 1052 elements
[1653171612.750794] [er-head:875751:0]           async.c:231  UCX  DEBUG added async handler 0x671520 [id=14 ref 1] ucs_rcache_invalidate_handler() to hash
[1653171612.750813] [er-head:875751:0]           async.c:493  UCX  DEBUG listening to async event fd 14 events 0x1 mode thread_spinlock
[1653171612.750921] [er-head:875751:0]          module.c:280  UCX  DEBUG loading modules for ucm
[1653171612.750931] [er-head:875751:0]           ib_md.c:1314 UCX  DEBUG mlx5_0: using registration cache
[1653171612.750938] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1653171612.751196] [er-head:875751:0]    ib_mlx5dv_md.c:1065 UCX  DEBUG mlx5_0: opened DEVX md
[1653171612.751201] [er-head:875751:0]           ib_md.c:1541 UCX  DEBUG mlx5_0: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1653171612.751407] [er-head:875750:0]           mpool.c:236  UCX  DEBUG mpool rcache_mp: allocated chunk 0x7ffff7f07008 of 151544 bytes with 1052 elements
[1653171612.751762] [er-head:875750:0]         uct_mem.c:293  UCX  DEBUG   could not allocate memory with any of the provided methods
[1653171612.751767] [er-head:875750:0]         uct_mem.c:293  UCX  DEBUG   could not allocate memory with any of the provided methods
[1653171612.751827] [er-head:875750:0]         uct_mem.c:293  UCX  DEBUG   could not allocate memory with any of the provided methods
[1653171612.755655] [er-head:875751:0]       ib_device.c:615  UCX  DEBUG mlx5_1: PF sysfs path is '/sys/devices/pci0000:40/0000:40:01.1/0000:41:00.1'
[1653171612.755680] [er-head:875751:0]            topo.c:175  UCX  DEBUG added sys_dev 1 for bus id 41:00.1
[1653171612.755686] [er-head:875751:0]       ib_device.c:646  UCX  DEBUG mlx5_1: bdf_name 0000:41:00.1 sys_dev 1
[1653171612.755709] [er-head:875751:0]       ib_device.c:670  UCX  DEBUG mlx5_1: vendor_id 0x15b3 device_id 4119
[1653171612.756022] [er-head:875751:0]    ib_mlx5dv_md.c:710  UCX  DEBUG mlx5_1: disable ODP because it's not supported for DEVX QP
[1653171612.756206] [er-head:875751:0]           async.c:231  UCX  DEBUG added async handler 0x71f7e0 [id=17 ref 1] uct_ib_async_event_handler() to hash
[1653171612.756224] [er-head:875751:0]           async.c:493  UCX  DEBUG listening to async event fd 17 events 0x1 mode thread_spinlock
[1653171612.756230] [er-head:875751:0]       ib_device.c:834  UCX  DEBUG initialized device 'mlx5_1' (InfiniBand channel adapter) with 1 ports
[1653171612.756354] [er-head:875751:0]           ib_md.c:1592 UCX  DEBUG mlx5_1: cuda GPUDirect RDMA is disabled
[1653171612.756362] [er-head:875751:0]           ib_md.c:1592 UCX  DEBUG mlx5_1: rocm GPUDirect RDMA is disabled
[1653171612.756372] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1653171612.756500] [er-head:875751:0]           ib_md.c:1314 UCX  DEBUG mlx5_1: using registration cache
[1653171612.756506] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1653171612.756687] [er-head:875751:0]    ib_mlx5dv_md.c:1065 UCX  DEBUG mlx5_1: opened DEVX md
[1653171612.756692] [er-head:875751:0]           ib_md.c:1541 UCX  DEBUG mlx5_1: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1653171612.757893] [er-head:875751:0]     ucp_context.c:1370 UCX  DEBUG closing md cma because it has no selected transport resources
[1653171612.757953] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1653171612.758057] [er-head:875751:0]     ucp_context.c:1370 UCX  DEBUG closing md knem because it has no selected transport resources
[1653171612.758079] [er-head:875751:0]         pgtable.c:618  UCX  DEBUG purge empty page table
[1653171612.758089] [er-head:875751:0]           mpool.c:154  UCX  DEBUG mpool rcache_mp destroyed
[1653171612.758102] [er-head:875751:0]        mm_xpmem.c:90   UCX  DEBUG xpmem version: 155653
[1653171612.758141] [er-head:875751:0]     ucp_context.c:1370 UCX  DEBUG closing md xpmem because it has no selected transport resources
[1653171612.758164] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1653171612.758223] [er-head:875751:0]     ucp_context.c:1924 UCX  DEBUG created ucp context osss-ucx 0x674a20 [2 mds 8 tls] features 0xe tl bitmap 0xff 0x0
[1653171612.758249] [er-head:875751:0]           mpool.c:236  UCX  DEBUG mpool rcache_mp: allocated chunk 0x7ffff7f51008 of 151544 bytes with 1052 elements
[1653171612.758305] [er-head:875751:0]           mpool.c:236  UCX  DEBUG mpool rcache_mp: allocated chunk 0x7ffff7f2c008 of 151544 bytes with 1052 elements
[1653171612.758725] [er-head:875751:0]           mpool.c:236  UCX  DEBUG mpool rcache_mp: allocated chunk 0x7ffff7f07008 of 151544 bytes with 1052 elements
[1653171612.759088] [er-head:875751:0]         uct_mem.c:293  UCX  DEBUG   could not allocate memory with any of the provided methods
[1653171612.759093] [er-head:875751:0]         uct_mem.c:293  UCX  DEBUG   could not allocate memory with any of the provided methods
[1653171612.759170] [er-head:875751:0]         uct_mem.c:293  UCX  DEBUG   could not allocate memory with any of the provided methods
[1653171612.768121] [er-head:875750:0]        ib_iface.c:864  UCX  DEBUG using pkey[0] 0xffff on mlx5_0:1/IB
[1653171612.769082] [er-head:875750:0]        ib_iface.c:1372 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 91 data_sz 8256
[1653171612.769138] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool rc_recv_desc: align 64, maxelems 4294967295, elemsize 8356
[1653171612.769144] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool rc_send_desc: align 64, maxelems 4294967295, elemsize 8320
[1653171612.769171] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool send-ops-mpool: align 64, maxelems 4294967295, elemsize 48
[1653171612.769857] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool pending-ops: align 1, maxelems 4294967295, elemsize 64
[1653171612.769869] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool rc_verbs_short_desc: align 64, maxelems 4294967295, elemsize 192
[1653171612.770269] [er-head:875750:0]        ib_iface.c:982  UCX  DEBUG iface=0x987ee0: created RC QP 0x89e2 on mlx5_0:1 TX wr:409 sge:5 inl:124 resp:64 RX wr:0 sge:0 resp:64
[1653171612.770779] [er-head:875750:0]      ucp_worker.c:1327 UCX  DEBUG created interface[0]=0x987ee0 using rc_verbs/mlx5_0:1 on worker 0x684f10
[1653171612.770980] [er-head:875750:0]        ib_iface.c:864  UCX  DEBUG using pkey[0] 0xffff on mlx5_0:1/IB
[1653171612.771014] [er-head:875750:0]       ib_device.c:1465 UCX  DEBUG max IB CQE size is 128
[1653171612.771918] [er-head:875750:0]        ib_iface.c:1372 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 90 data_sz 8256
[1653171612.771936] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool rc_recv_desc: align 64, maxelems 4294967295, elemsize 8356
[1653171612.771941] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool rc_send_desc: align 64, maxelems 4294967295, elemsize 8320
[1653171612.771965] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool send-ops-mpool: align 64, maxelems 4294967295, elemsize 48
[1653171612.772369] [er-head:875750:0]           mpool.c:236  UCX  DEBUG mpool devx dbrec: allocated chunk 0x721010 of 8176 bytes with 127 elements
[1653171612.772642] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool pending-ops: align 1, maxelems 4294967295, elemsize 64
[1653171612.772812] [er-head:875750:0]         ib_mlx5.c:996  UCX  DEBUG SL=0 (AR support - no) was selected on mlx5_0:1, SLs with AR support = { <none> }, SLs without AR support = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
d_recv_skb: allocated chunk 0x7ffff7e81018 of 544744 bytes with 128 elements
[1653171612.778822] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 4168
[1653171612.778843] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0xaa6020: adding gid fe80::b8ce:f603:b:fa4 to hash on device mlx5_0 port 1 index 0)
[1653171612.778859] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 1)
[1653171612.778869] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 2)
[1653171612.778879] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 3)
[1653171612.778889] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 4)
[1653171612.778901] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 5)
[1653171612.778911] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 6)
[1653171612.778921] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 7)
[1653171612.778949] [er-head:875751:0]        ib_iface.c:1372 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 90 data_sz 8256
[1653171612.778957] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool rc_recv_desc: align 64, maxelems 4294967295, elemsize 8356
[1653171612.778961] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool rc_send_desc: align 64, maxelems 4294967295, elemsize 8320
[1653171612.778981] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool send-ops-mpool: align 64, maxelems 4294967295, elemsize 48
[1653171612.779176] [er-head:875750:0]     timer_wheel.c:40   UCX  DEBUG high res timer created log=23 resolution=2712.431042 usec wanted: 2499.999838 usec
[1653171612.779369] [er-head:875751:0]           mpool.c:236  UCX  DEBUG mpool devx dbrec: allocated chunk 0x721010 of 8176 bytes with 127 elements
[1653171612.779573] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool pending-ops: align 1, maxelems 4294967295, elemsize 64
[1653171612.779779] [er-head:875751:0]         ib_mlx5.c:996  UCX  DEBUG SL=0 (AR support - no) was selected on mlx5_0:1, SLs with AR support = { <none> }, SLs without AR support = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.780005] [er-head:875750:0]           async.c:231  UCX  DEBUG added async handler 0x9ab700 [id=21 ref 1] uct_ud_iface_async_handler() to hash
[1653171612.780020] [er-head:875750:0]           async.c:493  UCX  DEBUG listening to async event fd 21 events 0x5 mode thread_spinlock
[1653171612.780040] [er-head:875750:0]      ucp_worker.c:1327 UCX  DEBUG created interface[2]=0xaa6020 using ud_verbs/mlx5_0:1 on worker 0x684f10
[1653171612.780144] [er-head:875750:0]        ib_iface.c:864  UCX  DEBUG using pkey[0] 0xffff on mlx5_0:1/IB
[1653171612.780469] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool mlx5_dm_desc: align 64, maxelems 1, elemsize 64
[1653171612.780475] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool rc_mlx5_atomic_desc: align 64, maxelems 4294967295, elemsize 72
[1653171612.780734] [er-head:875750:0]        ib_iface.c:1372 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 44 data_sz 4096
[1653171612.781282] [er-head:875750:0]        ib_iface.c:982  UCX  DEBUG iface=0xbe4270: created UD QP 0x8f6e on mlx5_0:1 TX wr:341 sge:6 inl:124 resp:0 RX wr:4096 sge:1 resp:0
[1653171612.781292] [er-head:875750:0]         ib_mlx5.c:676  UCX  DEBUG tx wq 65536 bytes [bb=64, nwqe=1024] mmio_mode bf_post_mt
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.781623] [er-head:875751:0]           async.c:231  UCX  DEBUG added async handler 0x683750 [id=20 ref 1] uct_rc_mlx5_devx_iface_event_handler() to hash
[1653171612.781638] [er-head:875751:0]           async.c:493  UCX  DEBUG listening to async event fd 20 events 0x1 mode thread_spinlock
[1653171612.781661] [er-head:875751:0]      ucp_worker.c:1327 UCX  DEBUG created interface[1]=0xbfc2c0 using rc_mlx5/mlx5_0:1 on worker 0x684f10
[1653171612.781730] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool ud_recv_skb: align 64, maxelems 4294967295, elemsize 4196
[1653171612.781771] [er-head:875751:0]        ib_iface.c:864  UCX  DEBUG using pkey[0] 0xffff on mlx5_0:1/IB
[1653171612.782269] [er-head:875750:0]           mpool.c:236  UCX  DEBUG mpool ud_recv_skb: allocated chunk 0x7ffff00d5018 of 544744 bytes with 128 elements
[1653171612.782275] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 4168
[1653171612.782297] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0xbe4270: adding gid fe80::b8ce:f603:b:fa4 to hash on device mlx5_0 port 1 index 0)
[1653171612.782312] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 1)
[1653171612.782326] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 2)
[1653171612.782339] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 3)
[1653171612.782353] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 4)
[1653171612.782366] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 5)
[1653171612.782380] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 6)
[1653171612.782399] [er-head:875751:0]        ib_iface.c:1372 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 44 data_sz 4096
[1653171612.782393] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 7)
[1653171612.782521] [er-head:875750:0]         ib_mlx5.c:996  UCX  DEBUG SL=0 (AR support - no) was selected on mlx5_0:1, SLs with AR support = { <none> }, SLs without AR support = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
[1653171612.782574] [er-head:875750:0]     timer_wheel.c:40   UCX  DEBUG high res timer created log=23 resolution=2712.431042 usec wanted: 2499.999838 usec
[1653171612.782579] [er-head:875750:0]           async.c:231  UCX  DEBUG added async handler 0x71d4c0 [id=22 ref 1] uct_ud_iface_async_handler() to hash
[1653171612.782593] [er-head:875750:0]           async.c:493  UCX  DEBUG listening to async event fd 22 events 0x5 mode thread_spinlock
[1653171612.782610] [er-head:875750:0]      ucp_worker.c:1327 UCX  DEBUG created interface[3]=0xbe4270 using ud_mlx5/mlx5_0:1 on worker 0x684f10
[1653171612.782737] [er-head:875750:0]        ib_iface.c:864  UCX  DEBUG using pkey[0] 0xffff on mlx5_1:1/IB
[1653171612.782811] [er-head:875751:0]        ib_iface.c:982  UCX  DEBUG iface=0xaa6020: created UD QP 0x8f6f on mlx5_0:1 TX wr:341 sge:6 inl:124 resp:0 RX wr:4096 sge:1 resp:0
[1653171612.783222] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool ud_recv_skb: align 64, maxelems 4294967295, elemsize 4196
[1653171612.783463] [er-head:875750:0]        ib_iface.c:1372 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 91 data_sz 8256
[1653171612.783471] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool rc_recv_desc: align 64, maxelems 4294967295, elemsize 8356
[1653171612.783476] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool rc_send_desc: align 64, maxelems 4294967295, elemsize 8320
[1653171612.783496] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool send-ops-mpool: align 64, maxelems 4294967295, elemsize 48
[1653171612.783749] [er-head:875751:0]           mpool.c:236  UCX  DEBUG mpool ud_recv_skb: allocated chunk 0x7ffff7e81018 of 544744 bytes with 128 elements
[1653171612.783755] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 4168
[1653171612.783784] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xaa6020: adding gid fe80::b8ce:f603:b:fa4 to hash on device mlx5_0 port 1 index 0)
[1653171612.783804] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 1)
[1653171612.783818] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 2)
[1653171612.783832] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 3)
[1653171612.783846] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 4)
[1653171612.783859] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 5)
[1653171612.783873] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 6)
[1653171612.783887] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xaa6020: adding gid fe80:: to hash on device mlx5_0 port 1 index 7)
[1653171612.783908] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool pending-ops: align 1, maxelems 4294967295, elemsize 64
[1653171612.783913] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool rc_verbs_short_desc: align 64, maxelems 4294967295, elemsize 192
[1653171612.783997] [er-head:875751:0]     timer_wheel.c:40   UCX  DEBUG high res timer created log=23 resolution=2711.837139 usec wanted: 2499.999755 usec
[1653171612.784174] [er-head:875750:0]        ib_iface.c:982  UCX  DEBUG iface=0x703010: created RC QP 0xc055 on mlx5_1:1 TX wr:409 sge:5 inl:124 resp:64 RX wr:0 sge:0 resp:64
[1653171612.784600] [er-head:875750:0]      ucp_worker.c:1327 UCX  DEBUG created interface[4]=0x703010 using rc_verbs/mlx5_1:1 on worker 0x684f10
[1653171612.784724] [er-head:875750:0]        ib_iface.c:864  UCX  DEBUG using pkey[0] 0xffff on mlx5_1:1/IB
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.784792] [er-head:875751:0]           async.c:231  UCX  DEBUG added async handler 0x9ab700 [id=21 ref 1] uct_ud_iface_async_handler() to hash
[1653171612.784807] [er-head:875751:0]           async.c:493  UCX  DEBUG listening to async event fd 21 events 0x5 mode thread_spinlock
[1653171612.784827] [er-head:875751:0]      ucp_worker.c:1327 UCX  DEBUG created interface[2]=0xaa6020 using ud_verbs/mlx5_0:1 on worker 0x684f10
[1653171612.784932] [er-head:875751:0]        ib_iface.c:864  UCX  DEBUG using pkey[0] 0xffff on mlx5_0:1/IB
[1653171612.785612] [er-head:875751:0]        ib_iface.c:1372 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 44 data_sz 4096
[1653171612.785807] [er-head:875750:0]        ib_iface.c:1372 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 90 data_sz 8256
[1653171612.785815] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool rc_recv_desc: align 64, maxelems 4294967295, elemsize 8356
[1653171612.785819] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool rc_send_desc: align 64, maxelems 4294967295, elemsize 8320
[1653171612.785870] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool send-ops-mpool: align 64, maxelems 4294967295, elemsize 48
[1653171612.786161] [er-head:875751:0]        ib_iface.c:982  UCX  DEBUG iface=0xbe4270: created UD QP 0x8f70 on mlx5_0:1 TX wr:341 sge:6 inl:124 resp:0 RX wr:4096 sge:1 resp:0
[1653171612.786171] [er-head:875751:0]         ib_mlx5.c:676  UCX  DEBUG tx wq 65536 bytes [bb=64, nwqe=1024] mmio_mode bf_post_mt
[1653171612.786267] [er-head:875750:0]           mpool.c:236  UCX  DEBUG mpool devx dbrec: allocated chunk 0xc08010 of 8176 bytes with 127 elements
[1653171612.786458] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool pending-ops: align 1, maxelems 4294967295, elemsize 64
[1653171612.786547] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool ud_recv_skb: align 64, maxelems 4294967295, elemsize 4196
[1653171612.786601] [er-head:875750:0]         ib_mlx5.c:996  UCX  DEBUG SL=0 (AR support - no) was selected on mlx5_1:1, SLs with AR support = { <none> }, SLs without AR support = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
[1653171612.787053] [er-head:875751:0]           mpool.c:236  UCX  DEBUG mpool ud_recv_skb: allocated chunk 0x7ffff00d5018 of 544744 bytes with 128 elements
[1653171612.787060] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 4168
[1653171612.787071] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool mlx5_dm_desc: align 64, maxelems 1, elemsize 64
[1653171612.787075] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool rc_mlx5_atomic_desc: align 64, maxelems 4294967295, elemsize 72
[1653171612.787076] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xbe4270: adding gid fe80::b8ce:f603:b:fa4 to hash on device mlx5_0 port 1 index 0)
[1653171612.787086] [er-head:875750:0]           async.c:231  UCX  DEBUG added async handler 0x9ad4d0 [id=25 ref 1] uct_rc_mlx5_devx_iface_event_handler() to hash
[1653171612.787089] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 1)
[1653171612.787099] [er-head:875750:0]           async.c:493  UCX  DEBUG listening to async event fd 25 events 0x1 mode thread_spinlock
[1653171612.787098] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 2)
[1653171612.787108] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 3)
[1653171612.787119] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 4)
[1653171612.787111] [er-head:875750:0]      ucp_worker.c:1327 UCX  DEBUG created interface[5]=0x6a5d50 using rc_mlx5/mlx5_1:1 on worker 0x684f10
[1653171612.787129] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 5)
[1653171612.787139] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 6)
[1653171612.787149] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xbe4270: adding gid fe80:: to hash on device mlx5_0 port 1 index 7)
[1653171612.787203] [er-head:875750:0]        ib_iface.c:864  UCX  DEBUG using pkey[0] 0xffff on mlx5_1:1/IB
[1653171612.787285] [er-head:875751:0]         ib_mlx5.c:996  UCX  DEBUG SL=0 (AR support - no) was selected on mlx5_0:1, SLs with AR support = { <none> }, SLs without AR support = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
[1653171612.787321] [er-head:875751:0]     timer_wheel.c:40   UCX  DEBUG high res timer created log=23 resolution=2711.837139 usec wanted: 2499.999755 usec
[1653171612.787327] [er-head:875751:0]           async.c:231  UCX  DEBUG added async handler 0x71d4c0 [id=22 ref 1] uct_ud_iface_async_handler() to hash
[1653171612.787336] [er-head:875751:0]           async.c:493  UCX  DEBUG listening to async event fd 22 events 0x5 mode thread_spinlock
[1653171612.787349] [er-head:875751:0]      ucp_worker.c:1327 UCX  DEBUG created interface[3]=0xbe4270 using ud_mlx5/mlx5_0:1 on worker 0x684f10
[1653171612.787454] [er-head:875751:0]        ib_iface.c:864  UCX  DEBUG using pkey[0] 0xffff on mlx5_1:1/IB
[1653171612.787903] [er-head:875750:0]        ib_iface.c:1372 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 44 data_sz 4096
[1653171612.788093] [er-head:875751:0]        ib_iface.c:1372 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 91 data_sz 8256
[1653171612.788102] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool rc_recv_desc: align 64, maxelems 4294967295, elemsize 8356
[1653171612.788106] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool rc_send_desc: align 64, maxelems 4294967295, elemsize 8320
[1653171612.788128] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool send-ops-mpool: align 64, maxelems 4294967295, elemsize 48
[1653171612.788590] [er-head:875750:0]        ib_iface.c:982  UCX  DEBUG iface=0x7f2050: created UD QP 0xd851 on mlx5_1:1 TX wr:341 sge:6 inl:124 resp:0 RX wr:4096 sge:1 resp:0
[1653171612.788664] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool pending-ops: align 1, maxelems 4294967295, elemsize 64
[1653171612.788669] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool rc_verbs_short_desc: align 64, maxelems 4294967295, elemsize 192
[1653171612.789500] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool ud_recv_skb: align 64, maxelems 4294967295, elemsize 4196
[1653171612.789513] [er-head:875751:0]        ib_iface.c:982  UCX  DEBUG iface=0x703010: created RC QP 0xc056 on mlx5_1:1 TX wr:409 sge:5 inl:124 resp:64 RX wr:0 sge:0 resp:64
[1653171612.789998] [er-head:875751:0]      ucp_worker.c:1327 UCX  DEBUG created interface[4]=0x703010 using rc_verbs/mlx5_1:1 on worker 0x684f10
[1653171612.790086] [er-head:875750:0]           mpool.c:236  UCX  DEBUG mpool ud_recv_skb: allocated chunk 0x7ffff0050018 of 544744 bytes with 128 elements
[1653171612.790092] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 4168
[1653171612.790109] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0x7f2050: adding gid fe80::b8ce:f603:b:fa5 to hash on device mlx5_1 port 1 index 0)
[1653171612.790118] [er-head:875751:0]        ib_iface.c:864  UCX  DEBUG using pkey[0] 0xffff on mlx5_1:1/IB
[1653171612.790120] [er-head:875750:0]        ud_iface.c:420  UCX  DEBUG iface 0[1653171612.794769] [er-head:875750:0]           async.c:231  UCX  DEBUG added async handler 0x6c1b20 [id=24 ref 1] ucp_worker_iface_async_fd_event() to hash
[1653171612.794775] [er-head:875750:0]           async.c:493  UCX  DEBUG listening to async event fd 24 events 0x0 mode thread_spinlock
[1653171612.794788] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0x7f2050: adding gid fe80:: to hash on device mlx5_1 port 1 index 4)
[1653171612.794802] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0x7f2050: adding gid fe80:: to hash on device mlx5_1 port 1 index 5)
[1653171612.794816] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0x7f2050: adding gid fe80:: to hash on device mlx5_1 port 1 index 6)
[1653171612.794830] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0x7f2050: adding gid fe80:: to hash on device mlx5_1 port 1 index 7)
[1653171612.794932] [er-head:875751:0]     timer_wheel.c:40   UCX  DEBUG high res timer created log=23 resolution=2711.837139 usec wanted: 2499.999755 usec
[1653171612.794938] [er-head:875751:0]           async.c:231  UCX  DEBUG added async handler 0x8ff980 [id=26 ref 1] uct_ud_iface_async_handler() to hash
[1653171612.794951] [er-head:875751:0]           async.c:493  UCX  DEBUG listening to async event fd 26 events 0x5 mode thread_spinlock
[1653171612.794969] [er-head:875751:0]      ucp_worker.c:1327 UCX  DEBUG created interface[6]=0x7f2050 using ud_verbs/mlx5_1:1 on worker 0x684f10
[1653171612.795070] [er-head:875751:0]        ib_iface.c:864  UCX  DEBUG using pkey[0] 0xffff on mlx5_1:1/IB
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.795267] [er-head:875750:0]           async.c:231  UCX  DEBUG added async handler 0xc07870 [id=28 ref 1] uct_rdmacm_cm_event_handler() to hash
[1653171612.795276] [er-head:875750:0]           async.c:493  UCX  DEBUG listening to async event fd 28 events 0x1 mode thread_spinlock
[1653171612.795284] [er-head:875750:0]       rdmacm_cm.c:979  UCX  DEBUG created rdmacm_cm 0xc072b0 with event_channel 0xc07b30 (fd=28)
[1653171612.795302] [er-head:875750:0]      tcp_sockcm.c:222  UCX  DEBUG created tcp_sockcm 0xf395a0
[1653171612.795318] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool ucp_requests: align 64, maxelems 4294967295, elemsize 280
[1653171612.795322] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool ucp_rkeys: align 64, maxelems 4294967295, elemsize 112
[1653171612.795325] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool ucp_reg_bufs: align 64, maxelems 4294967295, elemsize 8216
[1653171612.795330] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 153
[1653171612.795334] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 1113
[1653171612.795336] [er-head:875750:0]           mpool.c:98   UCX  DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 8344
[1653171612.795340] [er-head:875750:0]       mpool_set.c:120  UCX  DEBUG mpool_set:ucp_am_bufs, sizes map 0x80000440, largest size 8255, mpools num 3
[1653171612.795360] [er-head:875750:0]      ucp_worker.c:1607 UCX  DEBUG worker 0x684f10: using device atomics
[1653171612.795382] [er-head:875750:0]          parser.c:1916 UCX  INFO  UCX_* env variables: UCX_TLS=rc UCX_LOG_LEVEL=debug
[1653171612.795791] [er-head:875751:0]        ib_iface.c:1372 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 44 data_sz 4096
[1653171612.796322] [er-head:875751:0]        ib_iface.c:982  UCX  DEBUG iface=0xac7060: created UD QP 0xd857 on mlx5_1:1 TX wr:341 sge:6 inl:124 resp:0 RX wr:4096 sge:1 resp:0
[1653171612.796329] [er-head:875751:0]         ib_mlx5.c:676  UCX  DEBUG tx wq 65536 bytes [bb=64, nwqe=1024] mmio_mode bf_post_mt
[1653171612.796653] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool ud_recv_skb: align 64, maxelems 4294967295, elemsize 4196
[1653171612.797107] [er-head:875751:0]           mpool.c:236  UCX  DEBUG mpool ud_recv_skb: allocated chunk 0x7fffe2b49018 of 544744 bytes with 128 elements
[1653171612.797113] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 4168
[1653171612.797134] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xac7060: adding gid fe80::b8ce:f603:b:fa5 to hash on device mlx5_1 port 1 index 0)
[1653171612.797149] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xac7060: adding gid fe80:: to hash on device mlx5_1 port 1 index 1)
[1653171612.797163] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xac7060: adding gid fe80:: to hash on device mlx5_1 port 1 index 2)
[1653171612.797175] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xac7060: adding gid fe80:: to hash on device mlx5_1 port 1 index 3)
[1653171612.797188] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xac7060: adding gid fe80:: to hash on device mlx5_1 port 1 index 4)
[1653171612.797202] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xac7060: adding gid fe80:: to hash on device mlx5_1 port 1 index 5)
[1653171612.797214] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xac7060: adding gid fe80:: to hash on device mlx5_1 port 1 index 6)
[1653171612.797227] [er-head:875751:0]        ud_iface.c:420  UCX  DEBUG iface 0xac7060: adding gid fe80:: to hash on device mlx5_1 port 1 index 7)
[1653171612.797361] [er-head:875751:0]         ib_mlx5.c:996  UCX  DEBUG SL=0 (AR support - no) was selected on mlx5_1:1, SLs with AR support = { <none> }, SLs without AR support = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
[1653171612.797394] [er-head:875751:0]     timer_wheel.c:40   UCX  DEBUG high res timer created log=23 resolution=2711.837139 usec wanted: 2499.999755 usec
[1653171612.797399] [er-head:875751:0]           async.c:231  UCX  DEBUG added async handler 0x6aecf0 [id=27 ref 1] uct_ud_iface_async_handler() to hash
[1653171612.797410] [er-head:875751:0]           async.c:493  UCX  DEBUG listening to async event fd 27 events 0x5 mode thread_spinlock
[1653171612.797422] [er-head:875751:0]      ucp_worker.c:1327 UCX  DEBUG created interface[7]=0xac7060 using ud_mlx5/mlx5_1:1 on worker 0x684f10
[1653171612.797428] [er-head:875751:0]      ucp_worker.c:1130 UCX  DEBUG selected scalable tl bitmap: 0xff 0x0 (8 tls)
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.798990] [er-head:875751:0]           async.c:231  UCX  DEBUG added async handler 0x9acd10 [id=18 ref 1] ucp_worker_iface_async_fd_event() to hash
[1653171612.799003] [er-head:875751:0]           async.c:493  UCX  DEBUG listening to async event fd 18 events 0x0 mode thread_spinlock
[1653171612.799048] [er-head:875751:0]           async.c:231  UCX  DEBUG added async handler 0x9acd50 [id=19 ref 1] ucp_worker_iface_async_fd_event() to hash
[1653171612.799055] [er-head:875751:0]           async.c:493  UCX  DEBUG listening to async event fd 19 events 0x0 mode thread_spinlock
[1653171612.799079] [er-head:875751:0]           async.c:231  UCX  DEBUG added async handler 0x6c1ae0 [id=23 ref 1] ucp_worker_iface_async_fd_event() to hash
[1653171612.799086] [er-head:875751:0]           async.c:493  UCX  DEBUG listening to async event fd 23 events 0x0 mode thread_spinlock
[1653171612.799097] [er-head:875751:0]           async.c:231  UCX  DEBUG added async handler 0x6c1b20 [id=24 ref 1] ucp_worker_iface_async_fd_event() to hash
[1653171612.799104] [er-head:875751:0]           async.c:493  UCX  DEBUG listening to async event fd 24 events 0x0 mode thread_spinlock
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.799532] [er-head:875751:0]           async.c:231  UCX  DEBUG added async handler 0xc07870 [id=28 ref 1] uct_rdmacm_cm_event_handler() to hash
[1653171612.799541] [er-head:875751:0]           async.c:493  UCX  DEBUG listening to async event fd 28 events 0x1 mode thread_spinlock
[1653171612.799549] [er-head:875751:0]       rdmacm_cm.c:979  UCX  DEBUG created rdmacm_cm 0xc072b0 with event_channel 0xc07b30 (fd=28)
[1653171612.799568] [er-head:875751:0]      tcp_sockcm.c:222  UCX  DEBUG created tcp_sockcm 0xf395a0
[1653171612.799582] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool ucp_requests: align 64, maxelems 4294967295, elemsize 280
[1653171612.799586] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool ucp_rkeys: align 64, maxelems 4294967295, elemsize 112
[1653171612.799589] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool ucp_reg_bufs: align 64, maxelems 4294967295, elemsize 8216
[1653171612.799594] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 153
[1653171612.799598] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 1113
[1653171612.799602] [er-head:875751:0]           mpool.c:98   UCX  DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 8344
[1653171612.799608] [er-head:875751:0]       mpool_set.c:120  UCX  DEBUG mpool_set:ucp_am_bufs, sizes map 0x80000440, largest size 8255, mpools num 3
[1653171612.799630] [er-head:875751:0]      ucp_worker.c:1607 UCX  DEBUG worker 0x684f10: using device atomics
[1653171612.799652] [er-head:875751:0]          parser.c:1916 UCX  INFO  UCX_* env variables: UCX_TLS=rc UCX_LOG_LEVEL=debug
[1653171612.800186] [er-head:875750:0]    ib_mlx5dv_md.c:202  UCX  DEBUG   mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM) failed, syndrome 0: Invalid argument
[1653171612.800184] [er-head:875751:0]    ib_mlx5dv_md.c:202  UCX  DEBUG   mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM) failed, syndrome 0: Invalid argument
[1653171612.800284] [er-head:875750:0]    ib_mlx5dv_md.c:202  UCX  DEBUG   mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM) failed, syndrome 0: Invalid argument
[1653171612.800283] [er-head:875751:0]    ib_mlx5dv_md.c:202  UCX  DEBUG   mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM) failed, syndrome 0: Invalid argument
[1653171612.800520] [er-head:875750:0]          ucp_ep.c:354  UCX  DEBUG created ep 0x7ffff7e3e000 to <no debug data> from api call
[1653171612.800520] [er-head:875751:0]          ucp_ep.c:354  UCX  DEBUG created ep 0x7ffff7e3e000 to <no debug data> from api call
[1653171612.800574] [er-head:875750:0]      ucp_worker.c:1782 UCX  INFO    ep_cfg[0]: rma(rc_mlx5/mlx5_0:1) amo(rc_mlx5/mlx5_0:1)
[1653171612.800574] [er-head:875751:0]      ucp_worker.c:1782 UCX  INFO    ep_cfg[0]: rma(rc_mlx5/mlx5_0:1) amo(rc_mlx5/mlx5_0:1)
[1653171612.800589] [er-head:875750:0]          wireup.c:1071 UCX  DEBUG   ep 0x7ffff7e3e000: am_lane 0 wireup_msg_lane 0 cm_lane <none> keepalive_lane <none> reachable_mds 0x3
[1653171612.800596] [er-head:875750:0]          wireup.c:1094 UCX  DEBUG   ep 0x7ffff7e3e000: lane[0]:  1:rc_mlx5/mlx5_0:1.0 md[0]      -> addr[1].md[0]/ib/sysdev[255] amo#0 am am_bw#0 wireup
[1653171612.800589] [er-head:875751:0]          wireup.c:1071 UCX  DEBUG   ep 0x7ffff7e3e000: am_lane 0 wireup_msg_lane 0 cm_lane <none> keepalive_lane <none> reachable_mds 0x3
[1653171612.800596] [er-head:875751:0]          wireup.c:1094 UCX  DEBUG   ep 0x7ffff7e3e000: lane[0]:  1:rc_mlx5/mlx5_0:1.0 md[0]      -> addr[1].md[0]/ib/sysdev[255] amo#0 am am_bw#0 wireup
[1653171612.801320] [er-head:875750:0]           rc_ep.c:165  UCX  DEBUG   created rc ep 0x990ef0
[1653171612.801320] [er-head:875751:0]           rc_ep.c:165  UCX  DEBUG   created rc ep 0x990ef0
[1653171612.801331] [er-head:875750:0]       wireup_ep.c:543  UCX  DEBUG   ep 0x7ffff7e3e000: wireup_ep 0x7244a0 created next_ep 0x990ef0 to <no debug data> using rc_mlx5/mlx5_0:1
[1653171612.801330] [er-head:875751:0]       wireup_ep.c:543  UCX  DEBUG   ep 0x7ffff7e3e000: wireup_ep 0x7244a0 created next_ep 0x990ef0 to <no debug data> using rc_mlx5/mlx5_0:1
[1653171612.801347] [er-head:875750:0]        ib_iface.c:750  UCX  DEBUG   iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.801347] [er-head:875751:0]        ib_iface.c:750  UCX  DEBUG   iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.801359] [er-head:875750:0]        ib_iface.c:750  UCX  DEBUG   iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.801359] [er-head:875751:0]        ib_iface.c:750  UCX  DEBUG   iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.801375] [er-head:875750:0]           ud_ep.c:404  UCX  DEBUG   created ep ep=0x9a9750 iface=0xbe4270 id=0
[1653171612.801375] [er-head:875751:0]           ud_ep.c:404  UCX  DEBUG   created ep ep=0x9a9750 iface=0xbe4270 id=0
[1653171612.801382] [er-head:875751:0]           ud_ep.c:508  UCX  DEBUG   mlx5_0:1/IB lid 1 qpn 0x8f70 epid 0 ep 0x9a9750 connected to IFACE lid 1 fe80::pkey 0xffff  qpn 0x8f6e
[1653171612.801387] [er-head:875751:0]        ib_iface.c:750  UCX  DEBUG   iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.801382] [er-head:875750:0]           ud_ep.c:508  UCX  DEBUG   mlx5_0:1/IB lid 1 qpn 0x8f6e epid 0 ep 0x9a9750 connected to IFACE lid 1 fe80::pkey 0xffff  qpn 0x8f6e
[1653171612.801387] [er-head:875750:0]        ib_iface.c:750  UCX  DEBUG   iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.801392] [er-head:875750:0]        ib_iface.c:750  UCX  DEBUG   iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.801393] [er-head:875751:0]        ib_iface.c:750  UCX  DEBUG   iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.802506] [er-head:875750:0]           mpool.c:236  UCX  DEBUG   mpool ud_tx_skb: allocated chunk 0x7fffe2400018 of 6291432 bytes with 1489 elements
[1653171612.802825] [er-head:875751:0]           mpool.c:236  UCX  DEBUG   mpool ud_tx_skb: allocated chunk 0x7fffe2400018 of 6291432 bytes with 1489 elements
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.803508] [er-head:875750:0]           async.c:231  UCX  DEBUG   added async handler 0xfdbb50 [id=1000014 ref 1] uct_ud_iface_timer() to hash
BFD: Dwarf Error: Can't find .debug_ranges section.
[1653171612.803571] [er-head:875750:0]       wireup_ep.c:317  UCX  DEBUG   ep 0x7ffff7e3e000: wireup_ep 0x7244a0 created aux_ep 0x9a9750 to <no debug data> using ud_mlx5/mlx5_0:1
[1653171612.803581] [er-head:875751:0]           async.c:231  UCX  DEBUG   added async handler 0xfdbb50 [id=1000014 ref 1] uct_ud_iface_timer() to hash
[1653171612.803622] [er-head:875751:0]       wireup_ep.c:317  UCX  DEBUG   ep 0x7ffff7e3e000: wireup_ep 0x7244a0 created aux_ep 0x9a9750 to <no debug data> using ud_mlx5/mlx5_0:1
[1653171612.808762] [er-head:875751:0]           mpool.c:236  UCX  DEBUG   mpool rc_recv_desc: allocated chunk 0x7fffd9800018 of 39845864 bytes with 4752 elements
[1653171612.809063] [er-head:875751:0]          wireup.c:1442 UCX  DEBUG ep 0x7ffff7e3e000: send wireup request (flags=0x40)
[1653171612.809103] [er-head:875751:0]           mpool.c:236  UCX  DEBUG   mpool ucp_rkeys: allocated chunk 0x97d390 of 16472 bytes with 128 elements
[1653171612.809124] [er-head:875751:0]          ucp_ep.c:354  UCX  DEBUG created ep 0x7ffff7e3e058 to <no debug data> from api call
[1653171612.809165] [er-head:875751:0]          wireup.c:1071 UCX  DEBUG   ep 0x7ffff7e3e058: am_lane 0 wireup_msg_lane 0 cm_lane <none> keepalive_lane <none> reachable_mds 0x3
[1653171612.809172] [er-head:875751:0]          wireup.c:1094 UCX  DEBUG   ep 0x7ffff7e3e058: lane[0]:  1:rc_mlx5/mlx5_0:1.0 md[0]      -> addr[1].md[0]/ib/sysdev[255] amo#0 am am_bw#0 wireup
[1653171612.809629] [er-head:875750:0]           mpool.c:236  UCX  DEBUG   mpool rc_recv_desc: allocated chunk 0x7fffd9800018 of 39845864 bytes with 4752 elements
[1653171612.809806] [er-head:875751:0]           rc_ep.c:165  UCX  DEBUG   created rc ep 0x9a9640
[1653171612.809814] [er-head:875751:0]       wireup_ep.c:543  UCX  DEBUG   ep 0x7ffff7e3e058: wireup_ep 0x9a2150 created next_ep 0x9a9640 to <no debug data> using rc_mlx5/mlx5_0:1
[1653171612.809826] [er-head:875751:0]        ib_iface.c:750  UCX  DEBUG   iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.809833] [er-head:875751:0]        ib_iface.c:750  UCX  DEBUG   iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.809840] [er-head:875751:0]           ud_ep.c:404  UCX  DEBUG   created ep ep=0x602ff0 iface=0xbe4270 id=1
[1653171612.809849] [er-head:875751:0]           ud_ep.c:508  UCX  DEBUG   mlx5_0:1/IB lid 1 qpn 0x8f70 epid 1 ep 0x602ff0 connected to IFACE lid 1 fe80::pkey 0xffff  qpn 0x8f70
[1653171612.809854] [er-head:875751:0]        ib_iface.c:750  UCX  DEBUG   iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.809859] [er-head:875751:0]        ib_iface.c:750  UCX  DEBUG   iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.809868] [er-head:875751:0]       wireup_ep.c:317  UCX  DEBUG   ep 0x7ffff7e3e058: wireup_ep 0x9a2150 created aux_ep 0x602ff0 to <no debug data> using ud_mlx5/mlx5_0:1
[1653171612.809875] [er-head:875751:0]          wireup.c:1442 UCX  DEBUG ep 0x7ffff7e3e058: send wireup request (flags=0x200080)
[1653171612.809992] [er-head:875750:0]          wireup.c:1442 UCX  DEBUG ep 0x7ffff7e3e000: send wireup request (flags=0x200080)
[1653171612.810029] [er-head:875750:a]        ib_iface.c:750  UCX  DEBUG iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.810041] [er-head:875750:a]           ud_ep.c:817  UCX  DEBUG simultaneous CREQ ep=0x9a9750(iface=0xbe4270 conn_sn=0 ep_id=0, dest_ep_id=0 rx_psn=1)
[1653171612.809999] [er-head:875751:a]        ib_iface.c:750  UCX  DEBUG iface 0xbe4270: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.810012] [er-head:875751:a]           ud_ep.c:817  UCX  DEBUG simultaneous CREQ ep=0x602ff0(iface=0xbe4270 conn_sn=0 ep_id=1, dest_ep_id=1 rx_psn=1)
[1653171612.812946] [er-head:875751:a]           mpool.c:236  UCX  DEBUG mpool ud_recv_skb: allocated chunk 0x7fffe0e00018 of 20971496 bytes with 4964 elements
[1653171612.813329] [er-head:875750:a]           mpool.c:236  UCX  DEBUG mpool ud_recv_skb: allocated chunk 0x7fffe0e00018 of 20971496 bytes with 4964 elements
fff7e3e058: send wireup request (flags=0x40)
[1653171612.814394] [er-head:875751:0]           mpool.c:236  UCX  DEBUG mpool ucp_requests: allocated chunk 0x950334 of 41044 bytes with 128 elements
[1653171612.814399] [er-head:875750:0]           mpool.c:236  UCX  DEBUG mpool ucp_requests: allocated chunk 0x950334 of 41044 bytes with 128 elements
[1653171612.814425] [er-head:875751:0]           flush.c:310  UCX  DEBUG flush_worker ep 0x7ffff7e3e000
[1653171612.814432] [er-head:875750:0]        ib_iface.c:750  UCX  DEBUG   iface 0xbfc2c0: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.814436] [er-head:875751:0]           flush.c:310  UCX  DEBUG flush_worker ep 0x7ffff7e3e058
[1653171612.814448] [er-head:875751:0]       wireup_ep.c:471  UCX  DEBUG ep 0x7ffff7e3e058: destroy wireup ep 0x9a2150
[1653171612.814456] [er-head:875751:0]          ucp_ep.c:1270 UCX  DEBUG ep 0x7ffff7e3e058: unprogress iface 0xbe4270 ud_mlx5/mlx5_0:1
[1653171612.814463] [er-head:875751:0]           ud_ep.c:1767 UCX  DEBUG ep 0x602ff0: disconnect
[1653171612.814654] [er-head:875750:0]    rc_mlx5_devx.c:444  UCX  DEBUG   connected rc devx qp 0x89ea on mlx5_0:1/IB to lid 1(+0) sl 0 remote_qp 0x89ea mtu 4096 timer 18x7 rnr 13x7 rd_atom 16
[1653171612.814670] [er-head:875750:0]           flush.c:310  UCX  DEBUG flush_worker ep 0x7ffff7e3e000
[1653171612.814682] [er-head:875750:0]           flush.c:310  UCX  DEBUG flush_worker ep 0x7ffff7e3e058
[1653171612.814688] [er-head:875750:0]       wireup_ep.c:471  UCX  DEBUG ep 0x7ffff7e3e000: destroy wireup ep 0x7244a0
[1653171612.814692] [er-head:875751:0]        ib_iface.c:750  UCX  DEBUG   iface 0xbfc2c0: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.814693] [er-head:875750:0]          ucp_ep.c:1270 UCX  DEBUG ep 0x7ffff7e3e000: unprogress iface 0xbe4270 ud_mlx5/mlx5_0:1
[1653171612.814697] [er-head:875750:0]           ud_ep.c:1767 UCX  DEBUG ep 0x9a9750: disconnect
[1653171612.814911] [er-head:875751:0]    rc_mlx5_devx.c:444  UCX  DEBUG   connected rc devx qp 0x89e7 on mlx5_0:1/IB to lid 1(+0) sl 0 remote_qp 0x89ee mtu 4096 timer 18x7 rnr 13x7 rd_atom 16
[1653171612.814922] [er-head:875750:0]        ib_iface.c:750  UCX  DEBUG   iface 0xbfc2c0: ah_attr dlid=1 sl=0 port=1 src_path_bits=0
[1653171612.815118] [er-head:875750:0]    rc_mlx5_devx.c:444  UCX  DEBUG   connected rc devx qp 0x89ee on mlx5_0:1/IB to lid 1(+0) sl 0 remote_qp 0x89e7 mtu 4096 timer 18x7 rnr 13x7 rd_atom 16
[1653171612.815124] [er-head:875750:0]       wireup_ep.c:471  UCX  DEBUG ep 0x7ffff7e3e058: destroy wireup ep 0x9a2150
[1653171612.815128] [er-head:875750:0]          ucp_ep.c:1270 UCX  DEBUG ep 0x7ffff7e3e058: unprogress iface 0xbe4270 ud_mlx5/mlx5_0:1
[1653171612.815135] [er-head:875750:0]           async.c:156  UCX  DEBUG removed async handler 0xfdbb50 [id=1000014 ref 1] uct_ud_iface_timer() from hash
[1653171612.815139] [er-head:875750:0]           async.c:546  UCX  DEBUG removing async handler 0xfdbb50 [id=1000014 ref 1] uct_ud_iface_timer()
[1653171612.815150] [er-head:875750:0]           async.c:171  UCX  DEBUG release async handler 0xfdbb50 [id=1000014 ref 0] uct_ud_iface_timer()
[1653171612.815163] [er-head:875750:0]           ud_ep.c:1767 UCX  DEBUG ep 0x7fffdc000e10: disconnect
[1653171612.816615] [er-head:875750:0]           mpool.c:236  UCX  DEBUG mpool rc_send_desc: allocated chunk 0x7fffe0200018 of 10485736 bytes with 1260 elements
[1653171612.816655] [er-head:875751:0]       wireup_ep.c:471  UCX  DEBUG ep 0x7ffff7e3e000: destroy wireup ep 0x7244a0
[1653171612.816664] [er-head:875751:0]          ucp_ep.c:1270 UCX  DEBUG ep 0x7ffff7e3e000: unprogress iface 0xbe4270 ud_mlx5/mlx5_0:1
[1653171612.816676] [er-head:875751:0]           async.c:156  UCX  DEBUG removed async handler 0xfdbb50 [id=1000014 ref 1] uct_ud_iface_timer() from hash
[1653171612.816684] [er-head:875751:0]           async.c:546  UCX  DEBUG removing async handler 0xfdbb50 [id=1000014 ref 1] uct_ud_iface_timer()
[1653171612.816697] [er-head:875751:0]           async.c:171  UCX  DEBUG release async handler 0xfdbb50 [id=1000014 ref 0] uct_ud_iface_timer()
[1653171612.816716] [er-head:875751:0]           ud_ep.c:1767 UCX  DEBUG ep 0x9a9750: disconnect
[1653171612.817172] [er-head:875751:0]           mpool.c:236  UCX  DEBUG mpool rc_mlx5_atomic_desc: allocated chunk 0x7ffff7e1b018 of 135144 bytes with 1055 elements
<hang>
yosefe commented 2 years ago

@tonycurtis can you pls post the hanging process backtrace and the reproducer code (a.out source)?

tonycurtis commented 2 years ago

I'm going to investigate more myself, something weird going on

tonycurtis commented 2 years ago

Looks like a relaxed order issue I'm not handling correctly.

tonycurtis commented 2 years ago

Looks like relaxed-order is disabled on this cluster (admin ran query command as listed here: https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1280442391/AMD+2nd+Gen+EPYC+CPU+Tuning+Guide+for+InfiniBand+HPC?preview=/1280442391/1280409615/image-20200206-071308.png#Relaxed-Ordering). However, UCX sees the AMD processor, and enables relaxed-order. Override with UCX_IB_PCI_RELAXED_ORDERING=no