Closed Honggang-LI closed 5 days ago
Currently SRQ support required for RC transport but not for UD transport. Can you try forcing UCX_TLS=ud?
[root@m201 ~]# UCX_LOG_LEVEL=TRACE UCX_TLS=ud UCX_NET_DEVICES=mlx5_1:1 timeout 5m ucx_perftest -t tag_bw
[1710151836.487419] [m201:4904 :0] debug.c:1146 UCX DEBUG using signal stack 0x7fb3e588a000 size 141824
[1710151836.500448] [m201:4904 :0] init.c:118 UCX DEBUG /lib64/libucs.so.0 loaded at 0x7fb3e5a18000
[1710151836.500487] [m201:4904 :0] init.c:120 UCX DEBUG cmd line: ucx_perftest -t tag_bw
[1710151836.500501] [m201:4904 :0] module.c:72 UCX DEBUG ucs library path: /lib64/libucs.so.0
[1710151836.500507] [m201:4904 :0] module.c:280 UCX DEBUG loading modules for ucs
[1710151836.500544] [m201:4904 :0] module.c:280 UCX DEBUG loading modules for ucx_perftest
[1710151836.500574] [m201:4904 :0] perftest.c:921 UCX WARN CPU affinity is not set (bound to 8 cpus). Performance may be impacted.
Waiting for connection...
Accepted connection from 10.2.0.51:56880
+----------------------------------------------------------------------------------------------------------+
| API: protocol layer |
| Test: tag match bandwidth |
| Data layout: (automatic) |
| Send memory: host |
| Recv memory: host |
| Message size: 8 |
+----------------------------------------------------------------------------------------------------------+
[1710151844.422088] [m201:4904 :0] libperf.c:213 UCX DEBUG set send allocator by send mem type host
[1710151844.422093] [m201:4904 :0] libperf.c:217 UCX DEBUG set recv allocator by recv mem type host
[1710151844.422104] [m201:4904 :0] time.c:22 UCX DEBUG arch clock frequency: 3500000000.00 Hz
[1710151844.422156] [m201:4904 :0] ucp_context.c:1969 UCX INFO Version 1.14.1 (loaded from /lib64/libucp.so.0)
[1710151844.422165] [m201:4904 :0] ucp_context.c:1739 UCX DEBUG estimated number of endpoints is 1
[1710151844.422168] [m201:4904 :0] ucp_context.c:1746 UCX DEBUG estimated number of endpoints per node is 1
[1710151844.422173] [m201:4904 :0] ucp_context.c:1753 UCX DEBUG estimated bcopy bandwidth is 6081740800.000000
[1710151844.422180] [m201:4904 :0] ucp_context.c:1813 UCX DEBUG allocation method[0] is md 'sysv'
[1710151844.422184] [m201:4904 :0] ucp_context.c:1813 UCX DEBUG allocation method[1] is md 'posix'
[1710151844.422190] [m201:4904 :0] ucp_context.c:1825 UCX DEBUG allocation method[2] is 'huge'
[1710151844.422193] [m201:4904 :0] ucp_context.c:1825 UCX DEBUG allocation method[3] is 'thp'
[1710151844.422196] [m201:4904 :0] ucp_context.c:1813 UCX DEBUG allocation method[4] is md '*'
[1710151844.422199] [m201:4904 :0] ucp_context.c:1825 UCX DEBUG allocation method[5] is 'mmap'
[1710151844.422201] [m201:4904 :0] ucp_context.c:1825 UCX DEBUG allocation method[6] is 'heap'
[1710151844.422223] [m201:4904 :0] module.c:280 UCX DEBUG loading modules for uct
[1710151844.423446] [m201:4904 :0] module.c:280 UCX DEBUG loading modules for uct_ib
[1710151844.423626] [m201:4904 :0] ucp_context.c:1445 UCX DEBUG closing md self because it has no selected transport resources
[1710151844.423999] [m201:4904 :0] uct_iface.c:905 UCX DEBUG enp8s0: PF sysfs path is '/sys/devices/pci0000:00/0000:00:1c.4/0000:08:00.0'
[1710151844.424008] [m201:4904 :0] topo.c:176 UCX DEBUG added sys_dev 0 for bus id 08:00.0
[1710151844.424024] [m201:4904 :0] topo.c:359 UCX DEBUG enp8s0: bdf_name 0000:08:00.0 sys_dev 0
[1710151844.424873] [m201:4904 :0] uct_iface.c:905 UCX DEBUG mlx5_roce_p2: PF sysfs path is '/sys/devices/pci0000:00/0000:00:02.2/0000:05:00.1'
[1710151844.424881] [m201:4904 :0] topo.c:176 UCX DEBUG added sys_dev 1 for bus id 05:00.1
[1710151844.424897] [m201:4904 :0] topo.c:359 UCX DEBUG mlx5_roce_p2: bdf_name 0000:05:00.1 sys_dev 1
[1710151844.425204] [m201:4904 :0] uct_iface.c:905 UCX DEBUG enp9s0: PF sysfs path is '/sys/devices/pci0000:00/0000:00:1c.5/0000:09:00.0'
[1710151844.425211] [m201:4904 :0] topo.c:176 UCX DEBUG added sys_dev 2 for bus id 09:00.0
[1710151844.425215] [m201:4904 :0] topo.c:359 UCX DEBUG enp9s0: bdf_name 0000:09:00.0 sys_dev 2
[1710151844.425509] [m201:4904 :0] uct_iface.c:901 UCX DEBUG lo: sysfs path undetected
[1710151844.425515] [m201:4904 :0] topo.c:363 UCX DEBUG lo: system device unknown
[1710151844.425548] [m201:4904 :0] ucp_context.c:1445 UCX DEBUG closing md tcp because it has no selected transport resources
[1710151844.425575] [m201:4904 :0] ucp_context.c:1445 UCX DEBUG closing md sysv because it has no selected transport resources
[1710151844.425637] [m201:4904 :0] ucp_context.c:1445 UCX DEBUG closing md posix because it has no selected transport resources
[1710151844.440124] [m201:4904 :0] uct_iface.c:905 UCX DEBUG mlx5_0: PF sysfs path is '/sys/devices/pci0000:00/0000:00:02.2/0000:05:00.0'
[1710151844.440135] [m201:4904 :0] topo.c:176 UCX DEBUG added sys_dev 3 for bus id 05:00.0
[1710151844.440152] [m201:4904 :0] topo.c:359 UCX DEBUG mlx5_0: bdf_name 0000:05:00.0 sys_dev 3
[1710151844.440175] [m201:4904 :0] ib_device.c:495 UCX DEBUG mlx5_0: vendor_id 0x15b3 device_id 4117
[1710151844.440518] [m201:4904 :0] ib_mlx5dv_md.c:916 UCX DEBUG mlx5dv_devx_general_cmd(QUERY_HCA_CAP) failed: Remote I/O error
[1710151844.440524] [m201:4904 :0] ib_mlx5dv_md.c:1138 UCX DEBUG mlx5_0: crossing_vhca_mkey is not supported
[1710151844.440528] [m201:4904 :0] ib_mlx5dv_md.c:728 UCX DEBUG mlx5_0: disable ODP because it's not supported for DEVX QP
[1710151844.440737] [m201:4904 :0] async.c:231 UCX DEBUG added async handler 0x55c3ad548d90 [id=5 ref 1] ???() to hash
[1710151844.440807] [m201:4904 :0] async.c:493 UCX DEBUG listening to async event fd 5 events 0x1 mode thread_spinlock
[1710151844.440813] [m201:4904 :0] ib_device.c:593 UCX DEBUG initialized device 'mlx5_0' (InfiniBand channel adapter) with 1 ports
[1710151844.440824] [m201:4904 :0] ib_md.c:1788 UCX DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1710151844.440831] [m201:4904 :0] ib_md.c:1788 UCX DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1710151844.440842] [m201:4904 :0] ib_md.c:1815 UCX DEBUG mlx5_0: dmabuf is supported
[1710151844.440848] [m201:4904 :0] mpool.c:136 UCX DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1710151844.440880] [m201:4904 :0] async.c:231 UCX DEBUG added async handler 0x55c3ad548170 [id=9 ref 1] ???() to hash
[1710151844.440889] [m201:4904 :0] async.c:493 UCX DEBUG listening to async event fd 9 events 0x1 mode thread_spinlock
[1710151844.441020] [m201:4904 :0] module.c:280 UCX DEBUG loading modules for ucm
[1710151844.441027] [m201:4904 :0] ib_md.c:1500 UCX DEBUG mlx5_0: using registration cache
[1710151844.441031] [m201:4904 :0] mpool.c:136 UCX DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1710151844.441317] [m201:4904 :0] ib_mlx5dv_md.c:1219 UCX DEBUG mlx5_0: opened DEVX md
[1710151844.441500] [m201:4904 :0] ib_mlx5dv_md.c:884 UCX DEBUG created indirect rkey 0x17fe00 for remote flush
[1710151844.441507] [m201:4904 :0] ib_md.c:1727 UCX DEBUG mlx5_0: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1710151844.441536] [m201:4904 :0] ib_device.c:1061 UCX DEBUG no compatible IB ports found for flags 0xc4
[1710151844.441542] [m201:4904 :0] uct_md.c:129 UCX DEBUG failed to query dc_mlx5 resources: No such device
[1710151844.442563] [m201:4904 :0] ib_device.c:1061 UCX DEBUG no compatible IB ports found for flags 0x0
[1710151844.442570] [m201:4904 :0] uct_md.c:129 UCX DEBUG failed to query rc_verbs resources: No such device
[1710151844.442576] [m201:4904 :0] ib_device.c:1061 UCX DEBUG no compatible IB ports found for flags 0x4
[1710151844.442592] [m201:4904 :0] uct_md.c:129 UCX DEBUG failed to query rc_mlx5 resources: No such device
[1710151844.442598] [m201:4904 :0] ib_device.c:1061 UCX DEBUG no compatible IB ports found for flags 0x0
[1710151844.442602] [m201:4904 :0] uct_md.c:129 UCX DEBUG failed to query ud_verbs resources: No such device
[1710151844.442608] [m201:4904 :0] ib_device.c:1061 UCX DEBUG no compatible IB ports found for flags 0x4
[1710151844.442613] [m201:4904 :0] uct_md.c:129 UCX DEBUG failed to query ud_mlx5 resources: No such device
[1710151844.442619] [m201:4904 :0] ucp_context.c:1003 UCX DEBUG No tl resources found for md mlx5_0
[1710151844.442624] [m201:4904 :0] ucp_context.c:1445 UCX DEBUG closing md mlx5_0 because it has no selected transport resources
[1710151844.443026] [m201:4904 :0] mpool.c:194 UCX DEBUG mpool devx dbrec destroyed
[1710151844.443040] [m201:4904 :0] async.c:156 UCX DEBUG removed async handler 0x55c3ad548170 [id=9 ref 1] ???() from hash
[1710151844.443046] [m201:4904 :0] async.c:546 UCX DEBUG removing async handler 0x55c3ad548170 [id=9 ref 1] ???()
[1710151844.443054] [m201:4904 :0] async.c:171 UCX DEBUG release async handler 0x55c3ad548170 [id=9 ref 0] ???()
[1710151844.443064] [m201:4904 :0] pgtable.c:618 UCX DEBUG purge empty page table
[1710151844.443069] [m201:4904 :0] mpool.c:194 UCX DEBUG mpool rcache_mp destroyed
[1710151844.443075] [m201:4904 :0] ib_device.c:614 UCX DEBUG destroying ib device mlx5_0
[1710151844.443081] [m201:4904 :0] async.c:156 UCX DEBUG removed async handler 0x55c3ad548d90 [id=5 ref 1] ???() from hash
[1710151844.443086] [m201:4904 :0] async.c:546 UCX DEBUG removing async handler 0x55c3ad548d90 [id=5 ref 1] ???()
[1710151844.443134] [m201:4904 :0] async.c:171 UCX DEBUG release async handler 0x55c3ad548d90 [id=5 ref 0] ???()
[1710151844.445716] [m201:4904 :0] uct_iface.c:905 UCX DEBUG mlx5_1: PF sysfs path is '/sys/devices/pci0000:00/0000:00:02.2/0000:05:00.1'
[1710151844.445727] [m201:4904 :0] topo.c:359 UCX DEBUG mlx5_1: bdf_name 0000:05:00.1 sys_dev 1
[1710151844.445764] [m201:4904 :0] ib_device.c:495 UCX DEBUG mlx5_1: vendor_id 0x15b3 device_id 4117
[1710151844.446131] [m201:4904 :0] ib_mlx5dv_md.c:916 UCX DEBUG mlx5dv_devx_general_cmd(QUERY_HCA_CAP) failed: Remote I/O error
[1710151844.446137] [m201:4904 :0] ib_mlx5dv_md.c:1138 UCX DEBUG mlx5_1: crossing_vhca_mkey is not supported
[1710151844.446143] [m201:4904 :0] ib_mlx5dv_md.c:728 UCX DEBUG mlx5_1: disable ODP because it's not supported for DEVX QP
[1710151844.446374] [m201:4904 :0] async.c:231 UCX DEBUG added async handler 0x55c3ad54b170 [id=5 ref 1] ???() to hash
[1710151844.446433] [m201:4904 :0] async.c:493 UCX DEBUG listening to async event fd 5 events 0x1 mode thread_spinlock
[1710151844.446439] [m201:4904 :0] ib_device.c:593 UCX DEBUG initialized device 'mlx5_1' (InfiniBand channel adapter) with 1 ports
[1710151844.446451] [m201:4904 :0] ib_md.c:1788 UCX DEBUG mlx5_1: cuda GPUDirect RDMA is disabled
[1710151844.446460] [m201:4904 :0] ib_md.c:1788 UCX DEBUG mlx5_1: rocm GPUDirect RDMA is disabled
[1710151844.446471] [m201:4904 :0] ib_md.c:1815 UCX DEBUG mlx5_1: dmabuf is supported
[1710151844.446478] [m201:4904 :0] mpool.c:136 UCX DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1710151844.446491] [m201:4904 :0] async.c:231 UCX DEBUG added async handler 0x55c3ad54b830 [id=9 ref 1] ???() to hash
[1710151844.446500] [m201:4904 :0] async.c:493 UCX DEBUG listening to async event fd 9 events 0x1 mode thread_spinlock
[1710151844.446598] [m201:4904 :0] ib_md.c:1500 UCX DEBUG mlx5_1: using registration cache
[1710151844.446605] [m201:4904 :0] mpool.c:136 UCX DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1710151844.446872] [m201:4904 :0] ib_mlx5dv_md.c:1219 UCX DEBUG mlx5_1: opened DEVX md
[1710151844.447046] [m201:4904 :0] ib_mlx5dv_md.c:884 UCX DEBUG created indirect rkey 0x1fee00 for remote flush
[1710151844.447051] [m201:4904 :0] ib_md.c:1727 UCX DEBUG mlx5_1: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1710151844.447076] [m201:4904 :0] ib_device.c:1061 UCX DEBUG no compatible IB ports found for flags 0xc4
[1710151844.447081] [m201:4904 :0] uct_md.c:129 UCX DEBUG failed to query dc_mlx5 resources: No such device
[1710151844.448070] [m201:4904 :0] ucp_context.c:1445 UCX DEBUG closing md cma because it has no selected transport resources
[1710151844.448081] [m201:4904 :0] ucp_context.c:1478 UCX DEBUG register host memory on: mlx5_1
[1710151844.448097] [m201:4904 :0] ucp_context.c:1478 UCX DEBUG register cuda memory on:
[1710151844.448103] [m201:4904 :0] ucp_context.c:1478 UCX DEBUG register cuda-managed memory on:
[1710151844.448108] [m201:4904 :0] ucp_context.c:1478 UCX DEBUG register rocm memory on:
[1710151844.448113] [m201:4904 :0] ucp_context.c:1478 UCX DEBUG register rocm-managed memory on:
[1710151844.448126] [m201:4904 :0] mpool.c:136 UCX DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1710151844.448212] [m201:4904 :0] ucp_context.c:2038 UCX DEBUG created ucp context 0x55c3ad541000 0x55c3ad541000 [1 mds 2 tls] features 0x1 tl bitmap 0x3 0x0
[1710151844.448234] [m201:4904 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.448240] [m201:4904 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.448282] [m201:4904 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.448300] [m201:4904 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.448307] [m201:4904 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.448354] [m201:4904 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.448361] [m201:4904 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.448366] [m201:4904 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.448383] [m201:4904 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.448390] [m201:4904 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.448615] [m201:4904 :0] ib_iface.c:869 UCX DEBUG using pkey[0] 0xffff on mlx5_1:1/RoCE
[1710151844.448696] [m201:4904 :0] ib_device.c:924 UCX DEBUG mlx5_1:1 using gid_index 3
[1710151844.449351] [m201:4904 :0] ib_iface.c:1397 UCX DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 44 data_sz 1064
[1710151844.449686] [m201:4904 :0] ib_iface.c:987 UCX DEBUG iface=0x55c3ad59d5f0: created UD QP 0x922 on mlx5_1:1 TX wr:341 sge:6 inl:124 resp:0 RX wr:4096 sge:1 resp:0
[1710151844.450006] [m201:4904 :0] mpool.c:136 UCX DEBUG mpool ud_recv_skb: align 64, maxelems 4294967295, elemsize 1116
[1710151844.450110] [m201:4904 :0] mpool.c:281 UCX DEBUG mpool ud_recv_skb: allocated chunk 0x7fb3e4b09018 of 151528 bytes with 131 elements
[1710151844.450120] [m201:4904 :0] mpool.c:136 UCX DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 1136
[1710151844.450146] [m201:4904 :0] ud_iface.c:380 UCX DEBUG iface 0x55c3ad59d5f0: adding gid fe80::bace:f6ff:fe72:4be9 to hash on device mlx5_1 port 1 index 0)
[1710151844.450168] [m201:4904 :0] ud_iface.c:380 UCX DEBUG iface 0x55c3ad59d5f0: adding gid fe80::bace:f6ff:fe72:4be9 to hash on device mlx5_1 port 1 index 1)
[1710151844.450189] [m201:4904 :0] ud_iface.c:380 UCX DEBUG iface 0x55c3ad59d5f0: adding gid ::ffff:10.2.0.201 to hash on device mlx5_1 port 1 index 2)
[1710151844.450209] [m201:4904 :0] ud_iface.c:380 UCX DEBUG iface 0x55c3ad59d5f0: adding gid ::ffff:10.2.0.201 to hash on device mlx5_1 port 1 index 3)
[1710151844.450241] [m201:4904 :0] ud_iface.c:380 UCX DEBUG iface 0x55c3ad59d5f0: adding gid fe80::6531:5c5f:9f1c:8cf0 to hash on device mlx5_1 port 1 index 4)
[1710151844.450264] [m201:4904 :0] ud_iface.c:380 UCX DEBUG iface 0x55c3ad59d5f0: adding gid fe80::6531:5c5f:9f1c:8cf0 to hash on device mlx5_1 port 1 index 5)
[1710151844.452833] [m201:4904 :0] timer_wheel.c:40 UCX DEBUG high res timer created log=24 resolution=4793.490286 usec wanted: 2500.000000 usec
[1710151844.452847] [m201:4904 :0] async.c:231 UCX DEBUG added async handler 0x55c3ad54efd0 [id=11 ref 1] ???() to hash
[1710151844.452870] [m201:4904 :0] async.c:493 UCX DEBUG listening to async event fd 11 events 0x5 mode thread_spinlock
[1710151844.452888] [m201:4904 :0] ucp_worker.c:1328 UCX DEBUG created interface[0]=0x55c3ad59d5f0 using ud_verbs/mlx5_1:1 on worker 0x55c3ad5a2480
[1710151844.455426] [m201:4904 :0] ib_iface.c:869 UCX DEBUG using pkey[0] 0xffff on mlx5_1:1/RoCE
[1710151844.455512] [m201:4904 :0] ib_device.c:924 UCX DEBUG mlx5_1:1 using gid_index 3
[1710151844.455540] [m201:4904 :0] ib_device.c:1248 UCX DEBUG max IB CQE size is 128
[1710151844.456222] [m201:4904 :0] ib_iface.c:1397 UCX DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 44 data_sz 1064
[1710151844.456523] [m201:4904 :0] ib_iface.c:987 UCX DEBUG iface=0x55c3ad5c0030: created UD QP 0x924 on mlx5_1:1 TX wr:256 sge:2 inl:60 resp:0 RX wr:4096 sge:1 resp:0
[1710151844.456532] [m201:4904 :0] ib_mlx5.c:706 UCX DEBUG tx wq 32768 bytes [bb=64, nwqe=512] mmio_mode bf_post
[1710151844.456813] [m201:4904 :0] mpool.c:136 UCX DEBUG mpool ud_recv_skb: align 64, maxelems 4294967295, elemsize 1116
[1710151844.456919] [m201:4904 :0] mpool.c:281 UCX DEBUG mpool ud_recv_skb: allocated chunk 0x7fb3e4a62018 of 151528 bytes with 131 elements
[1710151844.456930] [m201:4904 :0] mpool.c:136 UCX DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 1136
[1710151844.456954] [m201:4904 :0] ud_iface.c:380 UCX DEBUG iface 0x55c3ad5c0030: adding gid fe80::bace:f6ff:fe72:4be9 to hash on device mlx5_1 port 1 index 0)
[1710151844.456976] [m201:4904 :0] ud_iface.c:380 UCX DEBUG iface 0x55c3ad5c0030: adding gid fe80::bace:f6ff:fe72:4be9 to hash on device mlx5_1 port 1 index 1)
[1710151844.456997] [m201:4904 :0] ud_iface.c:380 UCX DEBUG iface 0x55c3ad5c0030: adding gid ::ffff:10.2.0.201 to hash on device mlx5_1 port 1 index 2)
[1710151844.457013] [m201:4904 :0] ud_iface.c:380 UCX DEBUG iface 0x55c3ad5c0030: adding gid ::ffff:10.2.0.201 to hash on device mlx5_1 port 1 index 3)
[1710151844.457029] [m201:4904 :0] ud_iface.c:380 UCX DEBUG iface 0x55c3ad5c0030: adding gid fe80::6531:5c5f:9f1c:8cf0 to hash on device mlx5_1 port 1 index 4)
[1710151844.457043] [m201:4904 :0] ud_iface.c:380 UCX DEBUG iface 0x55c3ad5c0030: adding gid fe80::6531:5c5f:9f1c:8cf0 to hash on device mlx5_1 port 1 index 5)
[1710151844.459392] [m201:4904 :0] ib_mlx5.c:1028 UCX DEBUG SL=0 (AR support - no) was selected on mlx5_1:1, SLs with AR support = { <none> }, SLs without AR support = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
[1710151844.459409] [m201:4904 :0] timer_wheel.c:40 UCX DEBUG high res timer created log=24 resolution=4793.490286 usec wanted: 2500.000000 usec
[1710151844.459421] [m201:4904 :0] async.c:231 UCX DEBUG added async handler 0x55c3ad598d20 [id=12 ref 1] ???() to hash
[1710151844.459429] [m201:4904 :0] async.c:493 UCX DEBUG listening to async event fd 12 events 0x5 mode thread_spinlock
[1710151844.459440] [m201:4904 :0] ucp_worker.c:1328 UCX DEBUG created interface[1]=0x55c3ad5c0030 using ud_mlx5/mlx5_1:1 on worker 0x55c3ad5a2480
[1710151844.459445] [m201:4904 :0] ucp_worker.c:1131 UCX DEBUG selected scalable tl bitmap: 0x3 0x0 (2 tls)
[1710151844.459644] [m201:4904 :0] async.c:231 UCX DEBUG added async handler 0x55c3ad5f0fa0 [id=13 ref 1] ???() to hash
[1710151844.459655] [m201:4904 :0] async.c:493 UCX DEBUG listening to async event fd 13 events 0x1 mode thread_spinlock
[1710151844.459659] [m201:4904 :0] rdmacm_cm.c:974 UCX DEBUG created rdmacm_cm 0x55c3ad5f08f0 with event_channel 0x55c3ad5515d0 (fd=13)
[1710151844.459669] [m201:4904 :0] tcp_sockcm.c:223 UCX DEBUG created tcp_sockcm 0x55c3ad60f7f0
[1710151844.459675] [m201:4904 :0] mpool.c:136 UCX DEBUG mpool ucp_requests: align 64, maxelems 4294967295, elemsize 280
[1710151844.459679] [m201:4904 :0] mpool.c:136 UCX DEBUG mpool ucp_rkeys: align 64, maxelems 4294967295, elemsize 104
[1710151844.459682] [m201:4904 :0] mpool.c:136 UCX DEBUG mpool ucp_reg_bufs: align 64, maxelems 4294967295, elemsize 8216
[1710151844.459687] [m201:4904 :0] mpool.c:136 UCX DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 153
[1710151844.459690] [m201:4904 :0] mpool.c:136 UCX DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 1105
[1710151844.459694] [m201:4904 :0] mpool_set.c:129 UCX DEBUG mpool_set:ucp_am_bufs, sizes map 0x80000040, largest size 1016, mpools num 2
[1710151844.459730] [m201:4904 :0] parser.c:1998 UCX INFO UCX_* env variables: UCX_TLS=ud UCX_NET_DEVICES=mlx5_1:1 UCX_LOG_LEVEL=TRACE
[1710151844.459868] [m201:4904 :0] ucp_ep.c:362 UCX DEBUG created ep 0x7fb3e4a42000 to <no debug data> from api call
[1710151844.459908] [m201:4904 :0] ucp_worker.c:1783 UCX INFO ep_cfg[0]: tag(ud_verbs/mlx5_1:1)
[1710151844.459927] [m201:4904 :0] wireup.c:1192 UCX DEBUG ep 0x7fb3e4a42000: am_lane 0 wireup_msg_lane 0 cm_lane <none> keepalive_lane <none> reachable_mds 0x1
[1710151844.459934] [m201:4904 :0] wireup.c:1215 UCX DEBUG ep 0x7fb3e4a42000: lane[0]: 0:ud_verbs/mlx5_1:1.0 md[0] -> addr[0].md[0]/ib/sysdev[255] am am_bw#0 wireup
[1710151844.459942] [m201:4904 :0] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151844.459966] [m201:4904 :0] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151844.459974] [m201:4904 :0] ud_ep.c:406 UCX DEBUG created ep ep=0x55c3ad550e10 iface=0x55c3ad59d5f0 id=0
[1710151844.459980] [m201:4904 :0] ud_ep.c:510 UCX DEBUG mlx5_1:1/RoCE lid 0 qpn 0x922 epid 0 ep 0x55c3ad550e10 connected to IFACE ::ffff:10.2.0.51mtu 1024 pkey 0xffff qpn 0x14
[1710151844.459985] [m201:4904 :0] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151844.459990] [m201:4904 :0] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151844.460265] [m201:4904 :0] mpool.c:281 UCX DEBUG mpool ud_tx_skb: allocated chunk 0x7fb3e4800018 of 2097128 bytes with 1820 elements
[1710151844.460322] [m201:4904 :0] async.c:231 UCX DEBUG added async handler 0x55c3ad598f70 [id=1000007 ref 1] ???() to hash
[1710151844.461249] [m201:4904 :a] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151844.461280] [m201:4904 :a] ud_ep.c:821 UCX DEBUG simultaneous CREQ ep=0x55c3ad550e10(iface=0x55c3ad59d5f0 conn_sn=0 ep_id=0, dest_ep_id=0 rx_psn=1)
[1710151844.461973] [m201:4904 :a] mpool.c:281 UCX DEBUG mpool ud_recv_skb: allocated chunk 0x7fb3df800018 of 6291432 bytes with 5461 elements
[1710151844.498307] [m201:4904 :a] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151844.502378] [m201:4904 :0] mpool.c:281 UCX DEBUG mpool ucp_requests: allocated chunk 0x55c3ad634214 of 41044 bytes with 128 elements
[1710151844.502395] [m201:4904 :0] flush.c:377 UCX DEBUG flush_worker ep 0x7fb3e4a42000
[1710151844.502404] [m201:4904 :0] ud_ep.c:93 UCX DEBUG ep: 0x55c3ad550e10 ca drop@cwnd = 3 in flight: 1
[1710151844.502408] [m201:4904 :0] ud_ep.c:1427 UCX DEBUG ep(0x55c3ad550e10): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1710151844.502412] [m201:4904 :0] ud_ep.c:1433 UCX DEBUG ep(0x55c3ad550e10): resending completed
[1710151844.537652] [m201:4904 :0] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151844.607868] [m201:4904 :0] ud_ep.c:93 UCX DEBUG ep: 0x55c3ad550e10 ca drop@cwnd = 2 in flight: 1
[1710151844.607873] [m201:4904 :0] ud_ep.c:1427 UCX DEBUG ep(0x55c3ad550e10): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1710151844.607875] [m201:4904 :0] ud_ep.c:1433 UCX DEBUG ep(0x55c3ad550e10): resending completed
[1710151844.616596] [m201:4904 :0] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151844.766065] [m201:4904 :0] ud_ep.c:93 UCX DEBUG ep: 0x55c3ad550e10 ca drop@cwnd = 2 in flight: 1
[1710151844.766071] [m201:4904 :0] ud_ep.c:1427 UCX DEBUG ep(0x55c3ad550e10): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1710151844.766074] [m201:4904 :0] ud_ep.c:1433 UCX DEBUG ep(0x55c3ad550e10): resending completed
[1710151844.774512] [m201:4904 :0] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151845.082456] [m201:4904 :0] ud_ep.c:93 UCX DEBUG ep: 0x55c3ad550e10 ca drop@cwnd = 2 in flight: 1
[1710151845.082462] [m201:4904 :0] ud_ep.c:1427 UCX DEBUG ep(0x55c3ad550e10): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1710151845.082464] [m201:4904 :0] ud_ep.c:1433 UCX DEBUG ep(0x55c3ad550e10): resending completed
[1710151845.090349] [m201:4904 :0] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151845.720036] [m201:4904 :0] ud_ep.c:93 UCX DEBUG ep: 0x55c3ad550e10 ca drop@cwnd = 2 in flight: 1
[1710151845.720042] [m201:4904 :0] ud_ep.c:1427 UCX DEBUG ep(0x55c3ad550e10): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1710151845.720045] [m201:4904 :0] ud_ep.c:1433 UCX DEBUG ep(0x55c3ad550e10): resending completed
[1710151845.726979] [m201:4904 :0] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151846.999981] [m201:4904 :0] ud_ep.c:93 UCX DEBUG ep: 0x55c3ad550e10 ca drop@cwnd = 2 in flight: 1
[1710151846.999988] [m201:4904 :0] ud_ep.c:1427 UCX DEBUG ep(0x55c3ad550e10): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1710151847.000003] [m201:4904 :0] ud_ep.c:1433 UCX DEBUG ep(0x55c3ad550e10): resending completed
[1710151847.005112] [m201:4904 :0] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151849.559874] [m201:4904 :0] ud_ep.c:93 UCX DEBUG ep: 0x55c3ad550e10 ca drop@cwnd = 2 in flight: 1
[1710151849.559882] [m201:4904 :0] ud_ep.c:1427 UCX DEBUG ep(0x55c3ad550e10): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1710151849.559897] [m201:4904 :0] ud_ep.c:1433 UCX DEBUG ep(0x55c3ad550e10): resending completed
[1710151849.561416] [m201:4904 :0] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151854.463944] [m201:4904 :0] ud_ep.c:93 UCX DEBUG ep: 0x55c3ad550e10 ca drop@cwnd = 2 in flight: 1
[1710151854.463951] [m201:4904 :0] ud_ep.c:1427 UCX DEBUG ep(0x55c3ad550e10): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1710151854.463954] [m201:4904 :0] ud_ep.c:1433 UCX DEBUG ep(0x55c3ad550e10): resending completed
[1710151854.609858] [m201:4904 :0] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151859.368026] [m201:4904 :0] ud_ep.c:93 UCX DEBUG ep: 0x55c3ad550e10 ca drop@cwnd = 2 in flight: 1
[1710151859.368033] [m201:4904 :0] ud_ep.c:1427 UCX DEBUG ep(0x55c3ad550e10): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1710151859.368048] [m201:4904 :0] ud_ep.c:1433 UCX DEBUG ep(0x55c3ad550e10): resending completed
[1710151859.658278] [m201:4904 :0] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151864.272083] [m201:4904 :0] ud_ep.c:93 UCX DEBUG ep: 0x55c3ad550e10 ca drop@cwnd = 2 in flight: 1
[1710151864.272090] [m201:4904 :0] ud_ep.c:1427 UCX DEBUG ep(0x55c3ad550e10): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1710151864.272106] [m201:4904 :0] ud_ep.c:1433 UCX DEBUG ep(0x55c3ad550e10): resending completed
[1710151864.706729] [m201:4904 :0] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151869.176178] [m201:4904 :0] ud_ep.c:93 UCX DEBUG ep: 0x55c3ad550e10 ca drop@cwnd = 2 in flight: 1
[1710151869.176186] [m201:4904 :0] ud_ep.c:1427 UCX DEBUG ep(0x55c3ad550e10): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1710151869.176189] [m201:4904 :0] ud_ep.c:1433 UCX DEBUG ep(0x55c3ad550e10): resending completed
[1710151869.755174] [m201:4904 :0] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151874.080282] [m201:4904 :0] ud_ep.c:93 UCX DEBUG ep: 0x55c3ad550e10 ca drop@cwnd = 2 in flight: 1
[1710151874.080289] [m201:4904 :0] ud_ep.c:1427 UCX DEBUG ep(0x55c3ad550e10): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1710151874.080292] [m201:4904 :0] ud_ep.c:1433 UCX DEBUG ep(0x55c3ad550e10): resending completed
[1710151874.803599] [m201:4904 :0] ib_iface.c:755 UCX DEBUG iface 0x55c3ad59d5f0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.51 sgid_index=3 traffic_class=106
[1710151878.984366] [m201:4904 :0] ud_ep.c:93 UCX DEBUG ep: 0x55c3ad550e10 ca drop@cwnd = 2 in flight: 1
[1710151878.984376] [m201:4904 :0] ud_ep.c:375 UCX DEBUG ep 0x55c3ad550e10: timeout of 34.52 sec, config::peer_timeout - 30.00 sec
[1710151878.984392] [m201:4904 :0] ud_ep.c:1427 UCX DEBUG ep(0x55c3ad550e10): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1710151878.984395] [m201:4904 :0] ud_ep.c:1433 UCX DEBUG ep(0x55c3ad550e10): resending completed
[1710151878.984401] [m201:4904 :0] ucp_worker.c:530 UCX DEBUG worker 0x55c3ad5a2480: error handler called for UCT EP 0x55c3ad550e10: Endpoint timeout
[1710151878.984406] [m201:4904 :0] ucp_ep.c:1410 UCX DEBUG ep 0x7fb3e4a42000: set_ep_failed status Endpoint timeout on lane[0]=0x55c3ad550e10
[1710151878.984410] [m201:4904 :0] ucp_ep.c:1373 UCX DEBUG ep 0x7fb3e4a42000: discarding lanes
[1710151878.984414] [m201:4904 :0] ucp_ep.c:1381 UCX DEBUG ep 0x7fb3e4a42000: discard uct_ep[0]=0x55c3ad550e10
[1710151878.984420] [m201:4904 :0] ucp_ep.c:1449 UCX DIAG ep 0x7fb3e4a42000: error 'Endpoint timeout' on ud_verbs/mlx5_1:1 will not be handled since no error callback is installed
[m201:4904 :0:4904] ud_ep.c:278 Fatal: UD endpoint 0x55c3ad550e10 to <no debug data>: unhandled timeout error
==== backtrace (tid: 4904) ====
0 /lib64/libucs.so.0(ucs_handle_error+0x2e4) [0x7fb3e5a40d74]
1 /lib64/libucs.so.0(ucs_fatal_error_message+0xbc) [0x7fb3e5a40f2c]
2 /lib64/libucs.so.0(ucs_fatal_error_format+0x111) [0x7fb3e5a41051]
3 /lib64/ucx/libuct_ib.so.0(+0x57440) [0x7fb3e55cd440]
4 /lib64/libucs.so.0(+0x18917) [0x7fb3e5a30917]
5 /lib64/libucp.so.0(ucp_worker_progress+0x2a) [0x7fb3e5af5f6a]
6 /lib64/libucp.so.0(ucp_worker_flush+0x48) [0x7fb3e5b13d08]
7 ucx_perftest(+0xde16) [0x55c3acad9e16]
8 ucx_perftest(+0x8bce) [0x55c3acad4bce]
9 ucx_perftest(+0x5aa4) [0x55c3acad1aa4]
10 /lib64/libc.so.6(+0x3feb0) [0x7fb3e563feb0]
11 /lib64/libc.so.6(__libc_start_main+0x80) [0x7fb3e563ff60]
12 ucx_perftest(+0x63b5) [0x55c3acad23b5]
=================================
timeout: the monitored command dumped core
Aborted
[root@m201 ~]#
[root@m051 ~]# UCX_LOG_LEVEL=TRACE UCX_TLS=ud UCX_NET_DEVICES=irdma1:1 timeout 5m ucx_perftest -t tag_bw 10.2.0.201
[1710151844.415295] [m051:6032 :0] debug.c:1146 UCX DEBUG using signal stack 0x7f59f82ae000 size 141824
[1710151844.427592] [m051:6032 :0] init.c:118 UCX DEBUG /lib64/libucs.so.0 loaded at 0x7f59f843c000
[1710151844.427654] [m051:6032 :0] init.c:120 UCX DEBUG cmd line: ucx_perftest -t tag_bw 10.2.0.201
[1710151844.427675] [m051:6032 :0] module.c:72 UCX DEBUG ucs library path: /lib64/libucs.so.0
[1710151844.427684] [m051:6032 :0] module.c:280 UCX DEBUG loading modules for ucs
[1710151844.427717] [m051:6032 :0] module.c:280 UCX DEBUG loading modules for ucx_perftest
[1710151844.427773] [m051:6032 :0] perftest.c:921 UCX WARN CPU affinity is not set (bound to 40 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
| Stage | # iterations | 50.0%ile | average | overall | average | overall | average | overall |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[1710151844.428402] [m051:6032 :0] libperf.c:213 UCX DEBUG set send allocator by send mem type host
[1710151844.428422] [m051:6032 :0] libperf.c:217 UCX DEBUG set recv allocator by recv mem type host
[1710151844.428434] [m051:6032 :0] time.c:22 UCX DEBUG arch clock frequency: 1700000000.00 Hz
[1710151844.428526] [m051:6032 :0] ucp_context.c:1969 UCX INFO Version 1.14.1 (loaded from /lib64/libucp.so.0)
[1710151844.428539] [m051:6032 :0] ucp_context.c:1739 UCX DEBUG estimated number of endpoints is 1
[1710151844.428545] [m051:6032 :0] ucp_context.c:1746 UCX DEBUG estimated number of endpoints per node is 1
[1710151844.428554] [m051:6032 :0] ucp_context.c:1753 UCX DEBUG estimated bcopy bandwidth is 6081740800.000000
[1710151844.428567] [m051:6032 :0] ucp_context.c:1813 UCX DEBUG allocation method[0] is md 'sysv'
[1710151844.428574] [m051:6032 :0] ucp_context.c:1813 UCX DEBUG allocation method[1] is md 'posix'
[1710151844.428583] [m051:6032 :0] ucp_context.c:1825 UCX DEBUG allocation method[2] is 'huge'
[1710151844.428589] [m051:6032 :0] ucp_context.c:1825 UCX DEBUG allocation method[3] is 'thp'
[1710151844.428595] [m051:6032 :0] ucp_context.c:1813 UCX DEBUG allocation method[4] is md '*'
[1710151844.428601] [m051:6032 :0] ucp_context.c:1825 UCX DEBUG allocation method[5] is 'mmap'
[1710151844.428607] [m051:6032 :0] ucp_context.c:1825 UCX DEBUG allocation method[6] is 'heap'
[1710151844.428618] [m051:6032 :0] module.c:280 UCX DEBUG loading modules for uct
[1710151844.430623] [m051:6032 :0] module.c:280 UCX DEBUG loading modules for uct_ib
[1710151844.432401] [m051:6032 :0] ucp_context.c:1445 UCX DEBUG closing md self because it has no selected transport resources
[1710151844.432902] [m051:6032 :0] uct_iface.c:905 UCX DEBUG e810_roce_p2: PF sysfs path is '/sys/devices/pci0000:00/0000:00:02.0/0000:06:00.1'
[1710151844.432933] [m051:6032 :0] topo.c:176 UCX DEBUG added sys_dev 0 for bus id 06:00.1
[1710151844.432940] [m051:6032 :0] topo.c:359 UCX DEBUG e810_roce_p2: bdf_name 0000:06:00.1 sys_dev 0
[1710151844.433307] [m051:6032 :0] uct_iface.c:905 UCX DEBUG enp4s0f0: PF sysfs path is '/sys/devices/pci0000:00/0000:00:01.1/0000:04:00.0'
[1710151844.433321] [m051:6032 :0] topo.c:176 UCX DEBUG added sys_dev 1 for bus id 04:00.0
[1710151844.433327] [m051:6032 :0] topo.c:359 UCX DEBUG enp4s0f0: bdf_name 0000:04:00.0 sys_dev 1
[1710151844.433709] [m051:6032 :0] uct_iface.c:905 UCX DEBUG enp4s0f1: PF sysfs path is '/sys/devices/pci0000:00/0000:00:01.1/0000:04:00.1'
[1710151844.433723] [m051:6032 :0] topo.c:176 UCX DEBUG added sys_dev 2 for bus id 04:00.1
[1710151844.433745] [m051:6032 :0] topo.c:359 UCX DEBUG enp4s0f1: bdf_name 0000:04:00.1 sys_dev 2
[1710151844.434097] [m051:6032 :0] uct_iface.c:901 UCX DEBUG lo: sysfs path undetected
[1710151844.434107] [m051:6032 :0] topo.c:363 UCX DEBUG lo: system device unknown
[1710151844.434751] [m051:6032 :0] ucp_context.c:1445 UCX DEBUG closing md tcp because it has no selected transport resources
[1710151844.434815] [m051:6032 :0] ucp_context.c:1445 UCX DEBUG closing md sysv because it has no selected transport resources
[1710151844.434925] [m051:6032 :0] ucp_context.c:1445 UCX DEBUG closing md posix because it has no selected transport resources
[1710151844.435234] [m051:6032 :0] ib_mlx5dv_md.c:1255 UCX DEBUG irdma0: DEVX is not supported
[1710151844.435245] [m051:6032 :0] ib_md.c:1734 UCX DEBUG irdma0: md open by 'uct_ib_mlx5_devx_md_ops' failed, trying next
[1710151844.435251] [m051:6032 :0] ib_md.c:1734 UCX DEBUG irdma0: md open by 'uct_ib_mlx5_md_ops' failed, trying next
[1710151844.435744] [m051:6032 :0] uct_iface.c:905 UCX DEBUG irdma0: PF sysfs path is '/sys/devices/pci0000:00/0000:00:02.0/0000:06:00.0'
[1710151844.435773] [m051:6032 :0] topo.c:176 UCX DEBUG added sys_dev 3 for bus id 06:00.0
[1710151844.435780] [m051:6032 :0] topo.c:359 UCX DEBUG irdma0: bdf_name 0000:06:00.0 sys_dev 3
[1710151844.435816] [m051:6032 :0] ib_device.c:495 UCX DEBUG irdma0: vendor_id 0x8086 device_id 5522
[1710151844.435935] [m051:6032 :0] async.c:231 UCX DEBUG added async handler 0x5617c450c4d0 [id=5 ref 1] ???() to hash
[1710151844.436067] [m051:6032 :0] async.c:493 UCX DEBUG listening to async event fd 5 events 0x1 mode thread_spinlock
[1710151844.436078] [m051:6032 :0] ib_device.c:593 UCX DEBUG initialized device 'irdma0' (InfiniBand channel adapter) with 1 ports
[1710151844.436093] [m051:6032 :0] ib_md.c:1788 UCX DEBUG irdma0: cuda GPUDirect RDMA is disabled
[1710151844.436106] [m051:6032 :0] ib_md.c:1788 UCX DEBUG irdma0: rocm GPUDirect RDMA is disabled
[1710151844.436125] [m051:6032 :0] ib_md.c:1815 UCX DEBUG irdma0: dmabuf is supported
[1710151844.436136] [m051:6032 :0] mpool.c:136 UCX DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1710151844.436172] [m051:6032 :0] async.c:231 UCX DEBUG added async handler 0x5617c4514e40 [id=9 ref 1] ???() to hash
[1710151844.436186] [m051:6032 :0] async.c:493 UCX DEBUG listening to async event fd 9 events 0x1 mode thread_spinlock
[1710151844.436408] [m051:6032 :0] module.c:280 UCX DEBUG loading modules for ucm
[1710151844.436421] [m051:6032 :0] ib_md.c:1500 UCX DEBUG irdma0: using registration cache
[1710151844.436740] [m051:6032 :0] ib_md.c:1727 UCX DEBUG irdma0: md open by 'uct_ib_verbs_md_ops' is successful
[1710151844.436792] [m051:6032 :0] uct_md.c:129 UCX DEBUG failed to query dc_mlx5 resources: No such device
[1710151844.437200] [m051:6032 :0] ib_device.c:1061 UCX DEBUG no compatible IB ports found for flags 0x0
[1710151844.437211] [m051:6032 :0] uct_md.c:129 UCX DEBUG failed to query rc_verbs resources: No such device
[1710151844.437217] [m051:6032 :0] uct_md.c:129 UCX DEBUG failed to query rc_mlx5 resources: No such device
[1710151844.437238] [m051:6032 :0] ib_device.c:1061 UCX DEBUG no compatible IB ports found for flags 0x0
[1710151844.437244] [m051:6032 :0] uct_md.c:129 UCX DEBUG failed to query ud_verbs resources: No such device
[1710151844.437251] [m051:6032 :0] uct_md.c:129 UCX DEBUG failed to query ud_mlx5 resources: No such device
[1710151844.437257] [m051:6032 :0] ucp_context.c:1003 UCX DEBUG No tl resources found for md irdma0
[1710151844.437263] [m051:6032 :0] ucp_context.c:1445 UCX DEBUG closing md irdma0 because it has no selected transport resources
[1710151844.437287] [m051:6032 :0] async.c:156 UCX DEBUG removed async handler 0x5617c4514e40 [id=9 ref 1] ???() from hash
[1710151844.437295] [m051:6032 :0] async.c:546 UCX DEBUG removing async handler 0x5617c4514e40 [id=9 ref 1] ???()
[1710151844.437305] [m051:6032 :0] async.c:171 UCX DEBUG release async handler 0x5617c4514e40 [id=9 ref 0] ???()
[1710151844.437320] [m051:6032 :0] pgtable.c:618 UCX DEBUG purge empty page table
[1710151844.437327] [m051:6032 :0] mpool.c:194 UCX DEBUG mpool rcache_mp destroyed
[1710151844.437335] [m051:6032 :0] ib_device.c:614 UCX DEBUG destroying ib device irdma0
[1710151844.437342] [m051:6032 :0] async.c:156 UCX DEBUG removed async handler 0x5617c450c4d0 [id=5 ref 1] ???() from hash
[1710151844.437349] [m051:6032 :0] async.c:546 UCX DEBUG removing async handler 0x5617c450c4d0 [id=5 ref 1] ???()
[1710151844.437466] [m051:6032 :0] async.c:171 UCX DEBUG release async handler 0x5617c450c4d0 [id=5 ref 0] ???()
[1710151844.437648] [m051:6032 :0] ib_mlx5dv_md.c:1255 UCX DEBUG irdma1: DEVX is not supported
[1710151844.437659] [m051:6032 :0] ib_md.c:1734 UCX DEBUG irdma1: md open by 'uct_ib_mlx5_devx_md_ops' failed, trying next
[1710151844.437665] [m051:6032 :0] ib_md.c:1734 UCX DEBUG irdma1: md open by 'uct_ib_mlx5_md_ops' failed, trying next
[1710151844.438119] [m051:6032 :0] uct_iface.c:905 UCX DEBUG irdma1: PF sysfs path is '/sys/devices/pci0000:00/0000:00:02.0/0000:06:00.1'
[1710151844.438147] [m051:6032 :0] topo.c:359 UCX DEBUG irdma1: bdf_name 0000:06:00.1 sys_dev 0
[1710151844.438183] [m051:6032 :0] ib_device.c:495 UCX DEBUG irdma1: vendor_id 0x8086 device_id 5522
[1710151844.438286] [m051:6032 :0] async.c:231 UCX DEBUG added async handler 0x5617c45150d0 [id=5 ref 1] ???() to hash
[1710151844.438399] [m051:6032 :0] async.c:493 UCX DEBUG listening to async event fd 5 events 0x1 mode thread_spinlock
[1710151844.438410] [m051:6032 :0] ib_device.c:593 UCX DEBUG initialized device 'irdma1' (InfiniBand channel adapter) with 1 ports
[1710151844.438424] [m051:6032 :0] ib_md.c:1788 UCX DEBUG irdma1: cuda GPUDirect RDMA is disabled
[1710151844.438436] [m051:6032 :0] ib_md.c:1788 UCX DEBUG irdma1: rocm GPUDirect RDMA is disabled
[1710151844.438453] [m051:6032 :0] ib_md.c:1815 UCX DEBUG irdma1: dmabuf is supported
[1710151844.438463] [m051:6032 :0] mpool.c:136 UCX DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1710151844.438483] [m051:6032 :0] async.c:231 UCX DEBUG added async handler 0x5617c4510050 [id=9 ref 1] ???() to hash
[1710151844.438496] [m051:6032 :0] async.c:493 UCX DEBUG listening to async event fd 9 events 0x1 mode thread_spinlock
[1710151844.438641] [m051:6032 :0] ib_md.c:1500 UCX DEBUG irdma1: using registration cache
[1710151844.439004] [m051:6032 :0] ib_md.c:1727 UCX DEBUG irdma1: md open by 'uct_ib_verbs_md_ops' is successful
[1710151844.439058] [m051:6032 :0] uct_md.c:129 UCX DEBUG failed to query dc_mlx5 resources: No such device
[1710151844.439371] [m051:6032 :0] uct_md.c:129 UCX DEBUG failed to query rc_mlx5 resources: No such device
[1710151844.439412] [m051:6032 :0] uct_md.c:129 UCX DEBUG failed to query ud_mlx5 resources: No such device
[1710151844.439489] [m051:6032 :0] ucp_context.c:1445 UCX DEBUG closing md cma because it has no selected transport resources
[1710151844.439503] [m051:6032 :0] ucp_context.c:1478 UCX DEBUG register host memory on: irdma1
[1710151844.439510] [m051:6032 :0] ucp_context.c:1478 UCX DEBUG register cuda memory on:
[1710151844.439516] [m051:6032 :0] ucp_context.c:1478 UCX DEBUG register cuda-managed memory on:
[1710151844.439521] [m051:6032 :0] ucp_context.c:1478 UCX DEBUG register rocm memory on:
[1710151844.439527] [m051:6032 :0] ucp_context.c:1478 UCX DEBUG register rocm-managed memory on:
[1710151844.439550] [m051:6032 :0] mpool.c:136 UCX DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1710151844.439705] [m051:6032 :0] ucp_context.c:2038 UCX DEBUG created ucp context 0x5617c44ff010 0x5617c44ff010 [1 mds 1 tls] features 0x1 tl bitmap 0x1 0x0
[1710151844.439722] [m051:6032 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.439729] [m051:6032 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.439786] [m051:6032 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.439810] [m051:6032 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.439820] [m051:6032 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.439926] [m051:6032 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.439936] [m051:6032 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.439943] [m051:6032 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.439967] [m051:6032 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.439977] [m051:6032 :0] uct_mem.c:293 UCX DEBUG could not allocate memory with any of the provided methods
[1710151844.440442] [m051:6032 :0] ib_iface.c:869 UCX DEBUG using pkey[0] 0xffff on irdma1:1/RoCE
[1710151844.440643] [m051:6032 :0] ib_device.c:924 UCX DEBUG irdma1:1 using gid_index 1
[1710151844.440710] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.441241] [m051:6032 :0] ib_iface.c:1397 UCX DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 44 data_sz 1064
[1710151844.441816] [m051:6032 :0] ib_iface.c:987 UCX DEBUG iface=0x5617c4512bf0: created UD QP 0x14 on irdma1:1 TX wr:447 sge:6 inl:64 resp:0 RX wr:8191 sge:1 resp:0
[1710151844.441946] [m051:6032 :0] mpool.c:136 UCX DEBUG mpool ud_recv_skb: align 64, maxelems 4294967295, elemsize 1116
[1710151844.442092] [m051:6032 :0] mpool.c:281 UCX DEBUG mpool ud_recv_skb: allocated chunk 0x7f59f74b0018 of 151528 bytes with 131 elements
[1710151844.442108] [m051:6032 :0] mpool.c:136 UCX DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 1136
[1710151844.442150] [m051:6032 :0] ud_iface.c:380 UCX DEBUG iface 0x5617c4512bf0: adding gid fe80::b696:91ff:feb3:5ca9 to hash on device irdma1 port 1 index 0)
[1710151844.442184] [m051:6032 :0] ud_iface.c:380 UCX DEBUG iface 0x5617c4512bf0: adding gid ::ffff:10.2.0.51 to hash on device irdma1 port 1 index 1)
[1710151844.442214] [m051:6032 :0] ud_iface.c:380 UCX DEBUG iface 0x5617c4512bf0: adding gid fe80::478c:3987:7ddf:ac03 to hash on device irdma1 port 1 index 2)
[1710151844.442817] [m051:6032 :0] timer_wheel.c:40 UCX DEBUG high res timer created log=23 resolution=4934.475294 usec wanted: 2500.000000 usec
[1710151844.442837] [m051:6032 :0] async.c:231 UCX DEBUG added async handler 0x5617c45152c0 [id=11 ref 1] ???() to hash
[1710151844.442867] [m051:6032 :0] async.c:493 UCX DEBUG listening to async event fd 11 events 0x5 mode thread_spinlock
[1710151844.443009] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.443020] [m051:6032 :0] ucp_worker.c:1328 UCX DEBUG created interface[0]=0x5617c4512bf0 using ud_verbs/irdma1:1 on worker 0x5617c45312f0
[1710151844.443028] [m051:6032 :0] ucp_worker.c:1131 UCX DEBUG selected scalable tl bitmap: 0x1 0x0 (1 tls)
[1710151844.443077] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.443121] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.443157] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.443194] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.443231] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.443267] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.443304] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.443340] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.443387] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.443425] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.443462] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.443499] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.443536] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.443824] [m051:6032 :0] async.c:231 UCX DEBUG added async handler 0x5617c4504120 [id=12 ref 1] ???() to hash
[1710151844.443844] [m051:6032 :0] async.c:493 UCX DEBUG listening to async event fd 12 events 0x1 mode thread_spinlock
[1710151844.443852] [m051:6032 :0] rdmacm_cm.c:974 UCX DEBUG created rdmacm_cm 0x5617c4513500 with event_channel 0x5617c45154b0 (fd=12)
[1710151844.443871] [m051:6032 :0] tcp_sockcm.c:223 UCX DEBUG created tcp_sockcm 0x5617c44fbf60
[1710151844.443881] [m051:6032 :0] mpool.c:136 UCX DEBUG mpool ucp_requests: align 64, maxelems 4294967295, elemsize 280
[1710151844.443888] [m051:6032 :0] mpool.c:136 UCX DEBUG mpool ucp_rkeys: align 64, maxelems 4294967295, elemsize 104
[1710151844.443895] [m051:6032 :0] mpool.c:136 UCX DEBUG mpool ucp_reg_bufs: align 64, maxelems 4294967295, elemsize 8216
[1710151844.443902] [m051:6032 :0] mpool.c:136 UCX DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 153
[1710151844.443909] [m051:6032 :0] mpool.c:136 UCX DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 1105
[1710151844.443916] [m051:6032 :0] mpool_set.c:129 UCX DEBUG mpool_set:ucp_am_bufs, sizes map 0x80000040, largest size 1016, mpools num 2
[1710151844.443984] [m051:6032 :0] parser.c:1998 UCX INFO UCX_* env variables: UCX_TLS=ud UCX_NET_DEVICES=irdma1:1 UCX_LOG_LEVEL=TRACE
[1710151844.466424] [m051:6032 :0] ucp_ep.c:362 UCX DEBUG created ep 0x7f59f7490000 to <no debug data> from api call
[1710151844.466554] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.466569] [m051:6032 :0] ucp_worker.c:1783 UCX INFO ep_cfg[0]: tag(ud_verbs/irdma1:1)
[1710151844.466580] [m051:6032 :0] wireup.c:1192 UCX DEBUG ep 0x7f59f7490000: am_lane 0 wireup_msg_lane 0 cm_lane <none> keepalive_lane <none> reachable_mds 0x1
[1710151844.466590] [m051:6032 :0] wireup.c:1215 UCX DEBUG ep 0x7f59f7490000: lane[0]: 0:ud_verbs/irdma1:1.0 md[0] -> addr[0].md[0]/ib/sysdev[255] am am_bw#0 wireup
[1710151844.466603] [m051:6032 :0] ib_iface.c:755 UCX DEBUG iface 0x5617c4512bf0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.201 sgid_index=1 traffic_class=106
[1710151844.466756] [m051:6032 :0] ib_iface.c:755 UCX DEBUG iface 0x5617c4512bf0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.201 sgid_index=1 traffic_class=106
[1710151844.466771] [m051:6032 :0] ud_ep.c:406 UCX DEBUG created ep ep=0x5617c4514aa0 iface=0x5617c4512bf0 id=0
[1710151844.466781] [m051:6032 :0] ud_ep.c:510 UCX DEBUG irdma1:1/RoCE lid 1 qpn 0x14 epid 0 ep 0x5617c4514aa0 connected to IFACE ::ffff:10.2.0.201mtu 1024 pkey 0xffff qpn 0x922
[1710151844.466790] [m051:6032 :0] ib_iface.c:755 UCX DEBUG iface 0x5617c4512bf0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.201 sgid_index=1 traffic_class=106
[1710151844.466800] [m051:6032 :0] ib_iface.c:755 UCX DEBUG iface 0x5617c4512bf0: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.2.0.201 sgid_index=1 traffic_class=106
[1710151844.467430] [m051:6032 :0] mpool.c:281 UCX DEBUG mpool ud_tx_skb: allocated chunk 0x7f59f7200018 of 2097128 bytes with 1820 elements
[1710151844.467553] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.467594] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.467632] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.467668] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.467705] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.467748] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.467785] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.467821] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.467858] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.467894] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.467930] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.467966] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.468003] [m051:6032 :0] ib_device.c:1295 UCX DEBUG RoCE LAG level on irdma1:1 (e810_roce_p2) is 1
[1710151844.468024] [m051:6032 :0] async.c:231 UCX DEBUG added async handler 0x5617c44fc810 [id=1000006 ref 1] ???() to hash
[1710151844.469623] [m051:6032 :a] mpool.c:281 UCX DEBUG mpool ud_recv_skb: allocated chunk 0x7f59f6a00018 of 6291432 bytes with 5461 elements
[1710151844.469805] [m051:6032 :0] mpool.c:281 UCX DEBUG mpool ucp_requests: allocated chunk 0x5617c4564054 of 41044 bytes with 128 elements
[1710151844.469827] [m051:6032 :0] flush.c:377 UCX DEBUG flush_worker ep 0x7f59f7490000
[1710151844.504398] [m051:6032 :0] ud_ep.c:93 UCX DEBUG ep: 0x5617c4514aa0 ca drop@cwnd = 2 in flight: 1
[1710151844.504418] [m051:6032 :0] ud_ep.c:1427 UCX DEBUG ep(0x5617c4514aa0): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1710151844.504437] [m051:6032 :0] ud_ep.c:1433 UCX DEBUG ep(0x5617c4514aa0): resending completed
[1710151844.543878] [m051:6032 :0] ud_ep.c:93 UCX DEBUG ep: 0x5617c4514aa0 ca drop@cwnd = 2 in flight: 1
[1710151844.543887] [m051:6032 :0] ud_ep.c:1427 UCX DEBUG ep(0x5617c4514aa0): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1710151844.543906] [m051:6032 :0] ud_ep.c:1433 UCX DEBUG ep(0x5617c4514aa0): resending completed
[1710151844.622835] [m051:6032 :0] ud_ep.c:93 UCX DEBUG ep: 0x5617c4514aa0 ca drop@cwnd = 2 in flight: 1
[1710151844.622844] [m051:6032 :0] ud_ep.c:1427 UCX DEBUG ep(0x5617c4514aa0): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1710151844.622849] [m051:6032 :0] ud_ep.c:1433 UCX DEBUG ep(0x5617c4514aa0): resending completed
[1710151844.780752] [m051:6032 :0] ud_ep.c:93 UCX DEBUG ep: 0x5617c4514aa0 ca drop@cwnd = 2 in flight: 1
[1710151844.780760] [m051:6032 :0] ud_ep.c:1427 UCX DEBUG ep(0x5617c4514aa0): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1710151844.780765] [m051:6032 :0] ud_ep.c:1433 UCX DEBUG ep(0x5617c4514aa0): resending completed
[1710151845.096589] [m051:6032 :0] ud_ep.c:93 UCX DEBUG ep: 0x5617c4514aa0 ca drop@cwnd = 2 in flight: 1
[1710151845.096598] [m051:6032 :0] ud_ep.c:1427 UCX DEBUG ep(0x5617c4514aa0): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1710151845.096603] [m051:6032 :0] ud_ep.c:1433 UCX DEBUG ep(0x5617c4514aa0): resending completed
[1710151845.733204] [m051:6032 :0] ud_ep.c:93 UCX DEBUG ep: 0x5617c4514aa0 ca drop@cwnd = 2 in flight: 1
[1710151845.733223] [m051:6032 :0] ud_ep.c:1427 UCX DEBUG ep(0x5617c4514aa0): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1710151845.733242] [m051:6032 :0] ud_ep.c:1433 UCX DEBUG ep(0x5617c4514aa0): resending completed
[1710151847.011345] [m051:6032 :0] ud_ep.c:93 UCX DEBUG ep: 0x5617c4514aa0 ca drop@cwnd = 2 in flight: 1
[1710151847.011355] [m051:6032 :0] ud_ep.c:1427 UCX DEBUG ep(0x5617c4514aa0): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1710151847.011378] [m051:6032 :0] ud_ep.c:1433 UCX DEBUG ep(0x5617c4514aa0): resending completed
[1710151849.567643] [m051:6032 :0] ud_ep.c:93 UCX DEBUG ep: 0x5617c4514aa0 ca drop@cwnd = 2 in flight: 1
[1710151849.567659] [m051:6032 :0] ud_ep.c:1427 UCX DEBUG ep(0x5617c4514aa0): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1710151849.567679] [m051:6032 :0] ud_ep.c:1433 UCX DEBUG ep(0x5617c4514aa0): resending completed
[1710151854.616084] [m051:6032 :0] ud_ep.c:93 UCX DEBUG ep: 0x5617c4514aa0 ca drop@cwnd = 2 in flight: 1
[1710151854.616115] [m051:6032 :0] ud_ep.c:1427 UCX DEBUG ep(0x5617c4514aa0): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1710151854.616121] [m051:6032 :0] ud_ep.c:1433 UCX DEBUG ep(0x5617c4514aa0): resending completed
[1710151859.664507] [m051:6032 :0] ud_ep.c:93 UCX DEBUG ep: 0x5617c4514aa0 ca drop@cwnd = 2 in flight: 1
[1710151859.664535] [m051:6032 :0] ud_ep.c:1427 UCX DEBUG ep(0x5617c4514aa0): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1710151859.664541] [m051:6032 :0] ud_ep.c:1433 UCX DEBUG ep(0x5617c4514aa0): resending completed
[1710151864.712957] [m051:6032 :0] ud_ep.c:93 UCX DEBUG ep: 0x5617c4514aa0 ca drop@cwnd = 2 in flight: 1
[1710151864.712973] [m051:6032 :0] ud_ep.c:1427 UCX DEBUG ep(0x5617c4514aa0): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1710151864.712992] [m051:6032 :0] ud_ep.c:1433 UCX DEBUG ep(0x5617c4514aa0): resending completed
[1710151869.761395] [m051:6032 :0] ud_ep.c:93 UCX DEBUG ep: 0x5617c4514aa0 ca drop@cwnd = 2 in flight: 1
[1710151869.761418] [m051:6032 :0] ud_ep.c:1427 UCX DEBUG ep(0x5617c4514aa0): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1710151869.761437] [m051:6032 :0] ud_ep.c:1433 UCX DEBUG ep(0x5617c4514aa0): resending completed
[1710151874.809816] [m051:6032 :0] ud_ep.c:93 UCX DEBUG ep: 0x5617c4514aa0 ca drop@cwnd = 2 in flight: 1
[1710151874.809847] [m051:6032 :0] ud_ep.c:375 UCX DEBUG ep 0x5617c4514aa0: timeout of 30.34 sec, config::peer_timeout - 30.00 sec
[1710151874.809858] [m051:6032 :0] ud_ep.c:1427 UCX DEBUG ep(0x5617c4514aa0): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1710151874.809863] [m051:6032 :0] ud_ep.c:1433 UCX DEBUG ep(0x5617c4514aa0): resending completed
[1710151874.809875] [m051:6032 :0] ucp_worker.c:530 UCX DEBUG worker 0x5617c45312f0: error handler called for UCT EP 0x5617c4514aa0: Endpoint timeout
[1710151874.809886] [m051:6032 :0] ucp_ep.c:1410 UCX DEBUG ep 0x7f59f7490000: set_ep_failed status Endpoint timeout on lane[0]=0x5617c4514aa0
[1710151874.809894] [m051:6032 :0] ucp_ep.c:1373 UCX DEBUG ep 0x7f59f7490000: discarding lanes
[1710151874.809914] [m051:6032 :0] ucp_ep.c:1381 UCX DEBUG ep 0x7f59f7490000: discard uct_ep[0]=0x5617c4514aa0
[1710151874.809928] [m051:6032 :0] ucp_ep.c:1449 UCX DIAG ep 0x7f59f7490000: error 'Endpoint timeout' on ud_verbs/irdma1:1 will not be handled since no error callback is installed
[m051:6032 :0:6032] ud_ep.c:278 Fatal: UD endpoint 0x5617c4514aa0 to <no debug data>: unhandled timeout error
==== backtrace (tid: 6032) ====
0 /lib64/libucs.so.0(ucs_handle_error+0x2e4) [0x7f59f8464d74]
1 /lib64/libucs.so.0(ucs_fatal_error_message+0xbc) [0x7f59f8464f2c]
2 /lib64/libucs.so.0(ucs_fatal_error_format+0x111) [0x7f59f8465051]
3 /lib64/ucx/libuct_ib.so.0(+0x57440) [0x7f59f7fcd440]
4 /lib64/libucs.so.0(+0x18917) [0x7f59f8454917]
5 /lib64/libucp.so.0(ucp_worker_progress+0x2a) [0x7f59f8519f6a]
6 /lib64/libucp.so.0(ucp_worker_flush+0x48) [0x7f59f8537d08]
7 ucx_perftest(+0xde16) [0x5617c24e5e16]
8 ucx_perftest(+0x8bce) [0x5617c24e0bce]
9 ucx_perftest(+0x5aa4) [0x5617c24ddaa4]
10 /lib64/libc.so.6(+0x3feb0) [0x7f59f803feb0]
11 /lib64/libc.so.6(__libc_start_main+0x80) [0x7f59f803ff60]
12 ucx_perftest(+0x63b5) [0x5617c24de3b5]
=================================
timeout: the monitored command dumped core
Aborted
[root@m051 ~]#
it seems like there is a connectivity issue on the cluster, does basic ib_send_bw test work in this environment?
[root@m201 ~]# ib_send_bw -c UD -d mlx5_1 -F -a -b
************************************
* Waiting for client to connect... *
************************************
Max msg size in UD is MTU 1024
Changing to this MTU
---------------------------------------------------------------------------------------
Send Bidirectional BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : UD Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
RX depth : 1000
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0928 PSN 0x136bae
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:02:00:201
remote address: LID 0x01 QPN 0x0018 PSN 0xd8117e
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:02:00:51
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
2 1000 23.25 20.70 10.851089
4 1000 45.07 43.06 11.286887
8 1000 88.78 85.00 11.141716
16 1000 183.49 172.32 11.293055
32 1000 360.85 340.14 11.145629
64 1000 729.59 693.38 11.360355
128 1000 1458.41 1334.46 10.931875
256 1000 2862.08 2721.39 11.146794
512 1000 4929.65 4299.06 8.804469
1024 1000 7904.25 7357.28 7.533854
---------------------------------------------------------------------------------------
[root@m201 ~]#
[root@m051 ~]# ib_send_bw -c UD -d irdma1 -F -a -b 10.2.0.201
Max msg size in UD is MTU 1024
Changing to this MTU
---------------------------------------------------------------------------------------
Send Bidirectional BW Test
Dual-port : OFF Device : irdma1
Number of qps : 1 Transport type : IB
Connection type : UD Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : OFF
TX depth : 128
RX depth : 1000
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 1
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x01 QPN 0x0018 PSN 0xd8117e
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:02:00:51
remote address: LID 0000 QPN 0x0928 PSN 0x136bae
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:02:00:201
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
2 1000 23.25 20.70 10.851089
4 1000 45.07 43.06 11.286887
8 1000 88.78 85.00 11.141716
16 1000 183.49 172.32 11.293055
32 1000 360.85 340.14 11.145629
64 1000 729.59 693.38 11.360355
128 1000 1458.41 1334.46 10.931875
256 1000 2862.08 2721.39 11.146794
512 1000 4929.65 4299.06 8.804469
1024 1000 7904.25 7357.28 7.533854
---------------------------------------------------------------------------------------
[root@m051 ~]#
irdmaxx
maybe the HCA from Intel
irdmaxx
maybe the HCA from Intel
yes, irdmaxx is e810 from intel.
Was there ever a resolution to this? I have this exact same issue with ucx/roce on a Broadcom BCM57414. UCX fails with that same timeout, but ib_send_bw works fine.
AFAIK, UCX/roce was not tested on a Broadcom HW. I'd suggest checking with Broadcom about adding such support.
@yosefe We are seeing the same issue with another system using irdma. ib_send_bw works with UD but UCX just hits the timeout. I can probably spend some time in the next month debugging.
On our system the ibv_wc.sl is not set correctly so it is assuming the grh is IPv6. Forcing it to treat it like IPv4 works. I am still investigating why the service level is not set correctly but the irdma driver is likely to blame. I am looking at how to work around this in the code but it looks like setting UCX_ETH_DGID_CHECK=n is a good temporary solution (at least in my case). @Honggang-LI Can you give that a try? If it works for you as well I think Intel's driver is incorrect. I think setting sl is required... Looking into that.
Looking at the iRDMA provider it never sets the work completion service level. That is why UD does not working with iRDMA. My PR is likely the correct fix.
Following test based on commit 9029d9ca5c8e743f956b5c16e3f0e892c94ec48a .
Thanks for the fix.
[root@m161 ~]$ UCX_TLS=ud UCX_NET_DEVICES=mlx5_1:1 timeout 5m /tmp/ucx/bin/ucx_perftest -t tag_bw
[1732849777.425595] [m161:11617:0] perftest.c:800 UCX WARN CPU affinity is not set (bound to 8 cpus). Performance may be impacted.
Waiting for connection...
Accepted connection from 10.2.0.51:56186
+----------------------------------------------------------------------------------------------------------+
| API: protocol layer |
| Test: tag match bandwidth |
| Data layout: (automatic) |
| Send memory: host |
| Recv memory: host |
| Message size: 8 |
| Window size: 32 |
+----------------------------------------------------------------------------------------------------------+
[root@m161 ~]$
[root@m051 ~]$ UCX_TLS=ud UCX_NET_DEVICES=irdma1:1 timeout 5m /tmp/ucx/bin/ucx_perftest -t tag_bw 10.2.0.161
[1732849788.577876] [m051:137956:0] perftest.c:800 UCX WARN CPU affinity is not set (bound to 24 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
| Stage | # iterations | 50.0%ile | average | overall | average | overall | average | overall |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
Final: 1000000 0.228 0.703 0.703 10.85 10.85 1421816 1421816
We have RoCE hardware which does not support SRQ, in other words it does not support
ibv_create_srq
.Is ibv_create_srq mandatory for ucx? It seems ucx is really unhappy and almost all ucx tests are failed.
thanks