Closed schluenz closed 4 years ago
@schluenz can you pls provide OS and driver information as described in the GH issue template
ok, more complete info:
CentOS Linux release 7.7.1908 (Core) Linux max-exfl200.desy.de 3.10.0-1062.4.1.el7.x86_64 #1 SMP Fri Oct 18 17:15:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
rdma-core-22.1-3.el7.x86_64 libibverbs-22.1-3.el7.x86_64 ibv_devinfo at the end
System/RedHat ucx:
/usr/bin/ucx_info -v
# UCT version=1.4.0 revision 0000000
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check
/usr/bin/ucx_info -d
# Device: mlx5_0:1
#
# capabilities:
[1575906412.846334] [max-exfl200:137500:0] ib_iface.c:947 UCX ERROR Invalid active_width on mlx5_0:1: 16
# < failed to query interface >
#
#
# Transport: ud
#
# Device: mlx5_0:1
#
# capabilities:
[1575906412.855268] [max-exfl200:137500:0] ib_iface.c:947 UCX ERROR Invalid active_width on mlx5_0:1: 16
# < failed to query interface >
--------------------------------------------------------------------------------------------------------------------
Self-compiled 1.6.1:
/software/ucx/1.6.1/bin/ucx_info -v
# UCT version=1.6.1 revision
# configured with: --prefix=/software/ucx/1.6.1 --with-cuda=/usr/local/cuda-9.0 --enable-mt
# Device: ib0
#
# capabilities:
# bandwidth: 11142.51 MB/sec
# latency: 5206 nsec
# overhead: 50000 nsec
# am_short: <= 8K
# am_bcopy: <= 8K
# connection: to iface
# priority: 1
# device address: 4 bytes
# iface address: 2 bytes
# error handling: none
#
[1575906773.636572] [max-exfl200:138596:0] ib_mlx5_dv.c:157 UCX ERROR ibv_create_cq() failed: Invalid argument
# < failed to open memory domain ib/mlx5_0 >
--------------------------------------------------------------------------------------------------------------------
Self-compiled 1.7.0:
/software/ucx/1.7.0/bin/ucx_info -v
# UCT version=1.7.0 revision 7a5460d
# configured with: --prefix=/software/ucx/1.7.0
and self-compiled git-master:
/software/ucx/1.7.0/bin/ucx_info -v
# UCT version=1.8.0 revision 0dc670d
# configured with: --prefix=/software/ucx/1.7.0 --without-cuda
give the same result:
/software/ucx/1.7.0/bin/ucx_info -d
# Memory domain: mlx5_0
# Component: ib
# register: unlimited, cost: 90 nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
#
# Transport: rc_verbs
# Device: mlx5_0:1
[1575906843.530270] [max-exfl200:138808:0] ib_iface.c:629 UCX ERROR ibv_create_cq(cqe=4096) failed: Invalid argument
# < failed to open interface >
#
# Transport: rc_mlx5
# Device: mlx5_0:1
[1575906843.532303] [max-exfl200:138808:0] ib_mlx5.c:73 UCX ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument
# < failed to open interface >
#
# Transport: dc_mlx5
# Device: mlx5_0:1
[1575906843.534152] [max-exfl200:138808:0] ib_mlx5.c:73 UCX ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument
# < failed to open interface >
#
# Transport: ud_verbs
# Device: mlx5_0:1
[1575906843.534672] [max-exfl200:138808:0] ib_iface.c:629 UCX ERROR ibv_create_cq(cqe=256) failed: Invalid argument
# < failed to open interface >
#
# Transport: ud_mlx5
# Device: mlx5_0:1
[1575906843.535087] [max-exfl200:138808:0] ib_mlx5.c:73 UCX ERROR mlx5dv_create_cq(cqe=256) failed: Invalid argument
# < failed to open interface >
#
# Memory domain: rdmacm
# Component: rdmacm
# supports client-server connection establishment via sockaddr
# < no supported devices found >
--------------------------------------------------------------------------------------------------------------------
ibv_devinfo -vv
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 20.26.1040
node_guid: b859:9f03:004e:bd14
sys_image_guid: b859:9f03:004e:bd14
vendor_id: 0x02c9
vendor_part_id: 4123
hw_ver: 0x0
board_id: MT_0000000222
phys_port_cnt: 1
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffffffffff000
max_qp: 262144
max_qp_wr: 32768
device_cap_flags: 0xe97e1c36
BAD_PKEY_CNTR
BAD_QKEY_CNTR
AUTO_PATH_MIG
CHANGE_PHY_PORT
PORT_ACTIVE_EVENT
SYS_IMAGE_GUID
RC_RNR_NAK_GEN
MEM_WINDOW
UD_IP_CSUM
XRC
MEM_MGT_EXTENSIONS
MEM_WINDOW_TYPE_2B
MANAGED_FLOW_STEERING
Unknown flags: 0xC8480000
max_sge: 30
max_sge_rd: 30
max_cq: 16777216
max_cqe: 4194303
max_mr: 16777216
max_pd: 16777216
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 4194304
max_qp_init_rd_atom: 16
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd: 0
max_mw: 16777216
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 2097152
max_mcast_qp_attach: 240
max_total_mcast_qp_attach: 503316480
max_ah: 2147483647
max_fmr: 0
max_srq: 8388608
max_srq_wr: 32767
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 16
general_odp_caps:
ODP_SUPPORT
Unknown flags: 0x2
rc_odp_caps:
SUPPORT_SEND
SUPPORT_RECV
SUPPORT_WRITE
SUPPORT_READ
uc_odp_caps:
NO SUPPORT
ud_odp_caps:
SUPPORT_SEND
completion timestamp_mask: 0x7fffffffffffffff
hca_core_clock: 156250kHZ
device_cap_flags_ex: 0x11E97E1C36
PCI_WRITE_END_PADDING
Unknown flags: 0x100000000
tso_caps:
max_tso: 0
rss_caps:
max_rwq_indirection_tables: 0
max_rwq_indirection_table_size: 0
rx_hash_function: 0x0
rx_hash_fields_mask: 0x0
max_wq_type_rq: 0
packet_pacing_caps:
qp_rate_limit_min: 0kbps
qp_rate_limit_max: 0kbps
max_rndv_hdr_size: 64
max_num_tags: 127
max_ops: 32768
max_sge: 1
flags:
IBV_TM_CAP_RC
cq moderation caps:
max_cq_count: 65535
max_cq_period: 4095 us
maximum available device memory: 262144Bytes
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 17
port_lid: 102
port_lmc: 0x00
link_layer: InfiniBand
max_msg_sz: 0x40000000
port_cap_flags: 0x2259e848
port_cap_flags2: 0x0000
max_vl_num: 4 (3)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 128
gid_tbl_len: 8
subnet_timeout: 18
init_type_reply: 0
active_width: 2X (16)
active_speed: 50.0 Gbps (64)
phys_state: LINK_UP (5)
GID[ 0]: fe80:0000:0000:0000:b859:9f03:004e:bd14
Thanks, it seems similar to the issue reported on the mailing list , @Artemy-Mellanox pls take a look
@schluenz could you please also check ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 3088715
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 3088715
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Need to collect some debug info:
enable dynamic debug echo module mlx5_ib | sudo tee /sys/kernel/debug/dynamic_debug/control
run ucx_info with strace strace /software/ucx/1.7.0/bin/ucx_info -d
please collect ucx_info output and dmesg and send us
I did echo "module mlx5_ib +p" | tee /sys/kernel/debug/dynamic_debug/control strace -o ucx_info.strace.txt /software/ucx/1.7.0/bin/ucx_info -d > ucx_info.out.txt dmesg -T | grep "everything after adding mlx5_ib" > ucx_info.dmesg.txt
outputs are attached. dmesg contains three ucx_info -d runs.
thanks. Could you please repeat it with more debug: echo "module mlx5_core +p" | tee /sys/kernel/debug/dynamic_debug/control
and strace -s2048 ...
attached, just the strace -s2048 output
mlx5_core dynamic_debug dmesg actually would be more informative
sorry, attached
Could you please do following check:
look at dmesg snap
dmest | tail -500 | grep INPUT
From log you attached output is:
[Tue Dec 10 12:14:03 2019] mlx5_core 0000:5e:00.0: dump_command:772:(pid 239539): dump command MODIFY_CQ(0x403) INPUT
[Tue Dec 10 12:14:03 2019] mlx5_core 0000:5e:00.0: dump_command:772:(pid 239539): dump command MODIFY_CQ(0x403) INPUT
[Tue Dec 10 12:14:03 2019] mlx5_core 0000:5e:00.0: dump_command:772:(pid 239539): dump command MODIFY_CQ(0x403) INPUT
[Tue Dec 10 12:14:03 2019] mlx5_core 0000:5e:00.0: dump_command:772:(pid 239539): dump command MODIFY_CQ(0x403) INPUT
[Tue Dec 10 12:14:03 2019] mlx5_core 0000:5e:00.0: dump_command:772:(pid 239539): dump command MODIFY_CQ(0x403) INPUT
Please check with ps auxww | grep 239539
(use pid from dmesg). I want to know where those commands come from.
[3086989.576394] mlx5_core 0000:5e:00.0: dump_command:772:(pid 33983): dump command MODIFY_CQ(0x403) INPUT
ps auxww | grep 33983 root 33983 0.0 0.0 0 0 ? S 09:51 0:00 [kworker/u592:1]
May it be that second port is in RoCE mode and is active? Can we check it ibv_devinfo and ifconfig and shut it down - to silence it's debug traffic so we could see debug output related to ucx_info command.
not sure I understand the question correctly.
It's a single-port adapter:
Product Name: ConnectX-6 VPI adapter card, 100Gb/s (HDR100, EDR IB and 100GbE), single-port QSFP56,
The nodes are connected to HDR200 switches over split-port cables (hdr200->2x100).
I don't see any indication for RoCE being active.
We unavoidably use ipoib.
ip addr sh
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether a4:bf:01:5c:45:ad brd ff:ff:ff:ff:ff:ff
inet 131.169.179.44/24 brd 131.169.179.255 scope global noprefixroute dynamic eno1
valid_lft 400059sec preferred_lft 400059sec
inet6 fe80::a6bf:1ff:fe5c:45ad/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether a4:bf:01:5c:45:ae brd ff:ff:ff:ff:ff:ff
4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
link/infiniband 00:00:03:40:fe:80:00:00:00:00:00:00:b8:59:9f:03:00:4e:bd:14 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
inet 10.255.33.7/20 brd 10.255.47.255 scope global noprefixroute ib0
valid_lft forever preferred_lft forever
inet6 fe80::ba59:9f03:4e:bd14/64 scope link
valid_lft forever preferred_lft forever
5: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:d2:75:94:92 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
Could you please try simple ibv_rc_pingpong loopback test
output looks like this: local address: LID 0x0068, QPN 0x002035, PSN 0xefdd06, GID :: remote address: LID 0x0068, QPN 0x002035, PSN 0xefdd06, GID :: remote address: LID 0x0068, QPN 0x002034, PSN 0xd4cad5, GID :: 8192000 bytes in 0.01 seconds = 11336.45 Mbit/sec 8192000 bytes in 0.01 seconds = 11793.41 Mbit/sec 1000 iters in 0.01 seconds = 5.78 usec/iter 1000 iters in 0.01 seconds = 5.56 usec/iter
ps: I got the hint that the RedHat 7.6 hpc-x inbox still works, which is indeed the case. hpcx-v2.5.0-gcc-inbox-redhat7.6-x86_64 ucx_info -d works as expected. hpcx-v2.5.0-gcc-inbox-redhat7.7-x86_64 ucx_info -d throws errors like mentioned earlier: ib_iface.c:618 UCX ERROR ibv_create_cq(cqe=4096) failed: Invalid argument ib_mlx5.c:73 UCX ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument
@Artemy-Mellanox @schluenz I've got the same problem - let me know if there's anything I can do to help debug this... BTW reproduces with Ethernet (RoCE) on the latest CX6 firmware (20.26.4012) and UCX version (master).
Could you please build UCX from this debug branch and try again: https://github.com/Artemy-Mellanox/ucx/tree/dbg-4556
@Artemy-Mellanox
> #> UCX_LOG_LEVEL=debug ./build/bin/ucx_info -d
...
# Memory domain: mlx5_0
# Component: ib
# register: unlimited, cost: 90 nsec
# remote key: 24 bytes
# local memory handle is required for zcopy
#
[1577793869.845859] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.845878] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
# Transport: rc_verbs
# Device: mlx5_0:1
[1577793869.846028] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.846044] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1577793869.846169] [thunder6:167167:0] ib_iface.c:466 UCX DEBUG using pkey[0] 0xffff on mlx5_0:1
[1577793869.846276] [thunder6:167167:0] ib_device.c:529 UCX DEBUG testing addr_family on gid index 1: fe80::1e34:daff:fe66:fd34
[1577793869.846363] [thunder6:167167:0] ib_device.c:529 UCX DEBUG testing addr_family on gid index 3: ::ffff:192.168.30.16
[1577793869.846473] [thunder6:167167:0] ib_device.c:671 UCX DEBUG mlx5_0:1 using gid_index 3
[1577793869.846537] [thunder6:167167:0] ib_iface.c:516 UCX DEBUG Not using value 1 for path_bits - must be < 2^lmc (lmc=0)
[1577793869.847324] [thunder6:167167:0] mpool.c:205 UCX DEBUG mpool rcache_inv_mp: allocated chunk 0x7ff4453b2008 of 36856 bytes with 1151 elements
uct_ib_verbs_create_cq:626 (nil) 22
uct_ib_verbs_create_cq:636 (nil)
[1577793869.848130] [thunder6:167167:0] ib_iface.c:638 UCX ERROR ibv_create_cq(cqe=4096) failed: Invalid argument
# < failed to open interface >
#
[1577793869.848218] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.848236] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
# Transport: rc_mlx5
# Device: mlx5_0:1
[1577793869.848380] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.848401] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1577793869.848503] [thunder6:167167:0] ib_iface.c:466 UCX DEBUG using pkey[0] 0xffff on mlx5_0:1
[1577793869.848600] [thunder6:167167:0] ib_device.c:529 UCX DEBUG testing addr_family on gid index 1: fe80::1e34:daff:fe66:fd34
[1577793869.848690] [thunder6:167167:0] ib_device.c:529 UCX DEBUG testing addr_family on gid index 3: ::ffff:192.168.30.16
[1577793869.848757] [thunder6:167167:0] ib_device.c:671 UCX DEBUG mlx5_0:1 using gid_index 3
[1577793869.848816] [thunder6:167167:0] ib_iface.c:516 UCX DEBUG Not using value 1 for path_bits - must be < 2^lmc (lmc=0)
[1577793869.848853] [thunder6:167167:0] ib_device.c:1067 UCX DEBUG max IB CQE size is 128
[1577793869.850178] [thunder6:167167:0] ib_mlx5.c:73 UCX ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument
# < failed to open interface >
#
[1577793869.850264] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.850283] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
# Transport: dc_mlx5
# Device: mlx5_0:1
[1577793869.850437] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.850453] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1577793869.850569] [thunder6:167167:0] ib_iface.c:466 UCX DEBUG using pkey[0] 0xffff on mlx5_0:1
[1577793869.850664] [thunder6:167167:0] ib_device.c:529 UCX DEBUG testing addr_family on gid index 1: fe80::1e34:daff:fe66:fd34
[1577793869.850753] [thunder6:167167:0] ib_device.c:529 UCX DEBUG testing addr_family on gid index 3: ::ffff:192.168.30.16
[1577793869.850822] [thunder6:167167:0] ib_device.c:671 UCX DEBUG mlx5_0:1 using gid_index 3
[1577793869.850880] [thunder6:167167:0] ib_iface.c:516 UCX DEBUG Not using value 1 for path_bits - must be < 2^lmc (lmc=0)
[1577793869.852090] [thunder6:167167:0] ib_mlx5.c:73 UCX ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument
# < failed to open interface >
#
[1577793869.852163] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.852181] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
# Transport: ud_verbs
# Device: mlx5_0:1
[1577793869.852290] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.852310] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1577793869.852420] [thunder6:167167:0] ib_iface.c:466 UCX DEBUG using pkey[0] 0xffff on mlx5_0:1
[1577793869.852515] [thunder6:167167:0] ib_device.c:529 UCX DEBUG testing addr_family on gid index 1: fe80::1e34:daff:fe66:fd34
[1577793869.852604] [thunder6:167167:0] ib_device.c:529 UCX DEBUG testing addr_family on gid index 3: ::ffff:192.168.30.16
[1577793869.852673] [thunder6:167167:0] ib_device.c:671 UCX DEBUG mlx5_0:1 using gid_index 3
[1577793869.852726] [thunder6:167167:0] ib_iface.c:516 UCX DEBUG Not using value 1 for path_bits - must be < 2^lmc (lmc=0)
uct_ib_verbs_create_cq:626 (nil) 22
uct_ib_verbs_create_cq:636 (nil)
[1577793869.853124] [thunder6:167167:0] ib_iface.c:638 UCX ERROR ibv_create_cq(cqe=256) failed: Invalid argument
# < failed to open interface >
#
[1577793869.853179] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.853196] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
# Transport: ud_mlx5
# Device: mlx5_0:1
[1577793869.853300] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.853316] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1577793869.853396] [thunder6:167167:0] ib_iface.c:466 UCX DEBUG using pkey[0] 0xffff on mlx5_0:1
[1577793869.853486] [thunder6:167167:0] ib_device.c:529 UCX DEBUG testing addr_family on gid index 1: fe80::1e34:daff:fe66:fd34
[1577793869.853576] [thunder6:167167:0] ib_device.c:529 UCX DEBUG testing addr_family on gid index 3: ::ffff:192.168.30.16
[1577793869.853640] [thunder6:167167:0] ib_device.c:671 UCX DEBUG mlx5_0:1 using gid_index 3
[1577793869.853698] [thunder6:167167:0] ib_iface.c:516 UCX DEBUG Not using value 1 for path_bits - must be < 2^lmc (lmc=0)
[1577793869.853864] [thunder6:167167:0] ib_mlx5.c:73 UCX ERROR mlx5dv_create_cq(cqe=256) failed: Invalid argument
# < failed to open interface >
[1577793869.854050] [thunder6:167167:0] mpool.c:142 UCX DEBUG mpool devx dbrec destroyed
[1577793869.854123] [thunder6:167167:0] mpool.c:142 UCX DEBUG mpool rcache_inv_mp destroyed
[1577793869.854217] [thunder6:167167:0] ib_device.c:388 UCX DEBUG destroying ib device mlx5_0
[1577793869.860968] [thunder6:167167:0] ib_device.c:273 UCX DEBUG mlx5_1 vendor_id: 0x15b3 device_id: 4123
[1577793869.861165] [thunder6:167167:0] ib_mlx5dv_md.c:447 UCX DEBUG mlx5_1: disable ODP on RoCE
[1577793869.861361] [thunder6:167167:0] ib_device.c:368 UCX DEBUG initialized device 'mlx5_1' (InfiniBand channel adapter) with 1 ports
[1577793869.861450] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_1: cuda GPUDirect RDMA is disabled
[1577793869.861495] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_1: rocm GPUDirect RDMA is disabled
[1577793869.861520] [thunder6:167167:0] mpool.c:88 UCX DEBUG mpool rcache_inv_mp: align 1, maxelems 4294967295, elemsize 32
[1577793869.861614] [thunder6:167167:0] ib_md.c:1160 UCX DEBUG mlx5_1: using registration cache
[1577793869.861706] [thunder6:167167:0] mpool.c:88 UCX DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1577793869.861920] [thunder6:167167:0] ib_device.c:827 UCX DEBUG no compatible IB ports found for flags 0x0
[1577793869.861955] [thunder6:167167:0] uct_md.c:85 UCX DEBUG failed to query rc_verbs resources: No such device
[1577793869.861978] [thunder6:167167:0] ib_device.c:827 UCX DEBUG no compatible IB ports found for flags 0x4
[1577793869.861997] [thunder6:167167:0] uct_md.c:85 UCX DEBUG failed to query rc_mlx5 resources: No such device
[1577793869.862012] [thunder6:167167:0] ib_device.c:827 UCX DEBUG no compatible IB ports found for flags 0xc4
[1577793869.862025] [thunder6:167167:0] uct_md.c:85 UCX DEBUG failed to query dc_mlx5 resources: No such device
[1577793869.862045] [thunder6:167167:0] ib_device.c:827 UCX DEBUG no compatible IB ports found for flags 0x0
[1577793869.862061] [thunder6:167167:0] uct_md.c:85 UCX DEBUG failed to query ud_verbs resources: No such device
[1577793869.862074] [thunder6:167167:0] ib_device.c:827 UCX DEBUG no compatible IB ports found for flags 0x4
[1577793869.862087] [thunder6:167167:0] uct_md.c:85 UCX DEBUG failed to query ud_mlx5 resources: No such device
[1577793869.862108] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_1: cuda GPUDirect RDMA is disabled
[1577793869.862126] [thunder6:167167:0] ib_md.c:266 UCX DEBUG mlx5_1: rocm GPUDirect RDMA is disabled
#
# Memory domain: mlx5_1
...
@alex--m , could you please try again, force-pushed more debug
@Artemy-Mellanox
# Memory domain: mlx5_0
# Component: ib
# register: unlimited, cost: 90 nsec
# remote key: 24 bytes
# local memory handle is required for zcopy
#
# Transport: rc_verbs
# Device: mlx5_0:1
uct_ib_iface_t_init:858 0x12ebce0
uct_ib_verbs_create_cq:626 (nil) 22
uct_ib_verbs_create_cq:636 (nil) 0x12ebce0 0
[1577801148.897735] [thunder2:8084 :0] ib_iface.c:638 UCX ERROR ibv_create_cq(cqe=4096) failed: Invalid argument
# < failed to open interface >
#
# Transport: rc_mlx5
# Device: mlx5_0:1
uct_ib_iface_t_init:858 0x12ebce0
[1577801148.902712] [thunder2:8084 :0] ib_mlx5.c:73 UCX ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument
# < failed to open interface >
#
# Transport: dc_mlx5
# Device: mlx5_0:1
uct_ib_iface_t_init:858 0x12ebce0
[1577801148.907219] [thunder2:8084 :0] ib_mlx5.c:73 UCX ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument
# < failed to open interface >
#
# Transport: ud_verbs
# Device: mlx5_0:1
uct_ib_iface_t_init:858 0x12ebce0
uct_ib_verbs_create_cq:626 (nil) 22
uct_ib_verbs_create_cq:636 (nil) 0x12ebce0 0
[1577801148.910754] [thunder2:8084 :0] ib_iface.c:638 UCX ERROR ibv_create_cq(cqe=256) failed: Invalid argument
# < failed to open interface >
#
# Transport: ud_mlx5
# Device: mlx5_0:1
uct_ib_iface_t_init:858 0x12ebce0
[1577801148.914428] [thunder2:8084 :0] ib_mlx5.c:73 UCX ERROR mlx5dv_create_cq(cqe=256) failed: Invalid argument
# < failed to open interface >
@alex--m please try echo "module mlx5_core +p" | sudo tee /sys/kernel/debug/dynamic_debug/control
, run command and send me dmesg
@Artemy-Mellanox dmesg_afer_ucx_info.thunder6.txt dmesg.thunder6.txt
And also - С новым годом! :)
С новым годом! ;)
please send output of sudo find /sys/kernel/debug/mlx5/ -path *EQ* -type f -print -exec cat {} \;
@Artemy-Mellanox output.txt
please add UCX_IB_MLX5_DEVX=n
environment variable and rerun the command
@Artemy-Mellanox Do you mean the ucx_info command? (see below)
# Memory domain: mlx5_0
# Component: ib
# register: unlimited, cost: 90 nsec
# remote key: 24 bytes
# local memory handle is required for zcopy
#
# Transport: rc_verbs
# Device: mlx5_0:1
uct_ib_iface_t_init:858 0x8f3810
uct_ib_verbs_create_cq:626 0x91afd0 93
uct_ib_verbs_create_cq:636 0x91afd0 0x8f3810 0
uct_ib_verbs_create_cq:626 0x91b1f0 93
uct_ib_verbs_create_cq:636 0x91b1f0 0x8f3810 0
#
# capabilities:
# bandwidth: 0.00 + 10957.84 MB/sec
# latency: 800 nsec + 1 * N
# overhead: 75 nsec
# put_short: <= 124
# put_bcopy: <= 8K
# put_zcopy: <= 1G, up to 3 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 1K
# get_bcopy: <= 8K
# get_zcopy: 65..1G, up to 3 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 1K
# am_short: <= 123
# am_bcopy: <= 8191
# am_zcopy: <= 8191, up to 2 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 127
# domain: device
# atomic_add: 64 bit
# atomic_fadd: 64 bit
# atomic_cswap: 64 bit
# connection: to ep
# priority: 50
# device address: 17 bytes
# ep address: 3 bytes
# error handling: peer failure
#
#
# Transport: rc_mlx5
# Device: mlx5_0:1
uct_ib_iface_t_init:858 0x911f20
#
# capabilities:
# bandwidth: 0.00 + 10957.84 MB/sec
# latency: 800 nsec + 1 * N
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8K
# put_zcopy: <= 1G, up to 8 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 1K
# get_bcopy: <= 8K
# get_zcopy: 65..1G, up to 8 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 1K
# am_short: <= 2046
# am_bcopy: <= 8190
# am_zcopy: <= 8190, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 186
# domain: device
# atomic_add: 64 bit
# atomic_fadd: 64 bit
# atomic_cswap: 64 bit
# connection: to ep
# priority: 50
# device address: 17 bytes
# ep address: 7 bytes
# error handling: buffer (zcopy), remote access, peer failure
#
#
# Transport: ud_verbs
# Device: mlx5_0:1
uct_ib_iface_t_init:858 0x911f20
uct_ib_verbs_create_cq:626 0x915580 0
uct_ib_verbs_create_cq:636 0x915580 0x911f20 0
uct_ib_verbs_create_cq:626 0x9157a0 0
uct_ib_verbs_create_cq:636 0x9157a0 0x911f20 0
#
# capabilities:
# bandwidth: 0.00 + 10957.84 MB/sec
# latency: 810 nsec
# overhead: 105 nsec
# am_short: <= 116
# am_bcopy: <= 1016
# am_zcopy: <= 1016, up to 1 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 912
# connection: to ep, to iface
# priority: 50
# device address: 17 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure
#
#
# Transport: ud_mlx5
# Device: mlx5_0:1
uct_ib_iface_t_init:858 0x911f20
#
# capabilities:
# bandwidth: 0.00 + 10957.84 MB/sec
# latency: 810 nsec
# overhead: 80 nsec
# am_short: <= 180
# am_bcopy: <= 1016
# am_zcopy: <= 1016, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 132
# connection: to ep, to iface
# priority: 50
# device address: 17 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure
#
#
@schluenz Ok, that was the problem - on centos7.7 kernel UCX wrongly detects DEVX support.
@yosefe We will come with proper fix, meanwhile UCX_IB_MLX5_DEVX=n
is workaround.
@alex--m Please pull master branch to remove my debug prints.
Thanks @Artemy-Mellanox !
ตกลงข้อมูลที่สมบูรณ์มากขึ้น:
CentOS Linux ออก 7.7.1908 (Core) Linux max-exfl200.desy.de 3.10.0-1062.4.1.el7.x86_64 # 1 SMP ศุกร์ 18 ต.ค. 17:15:30 UTC 2019 x86_64 x86_64 x86_64 GNU / Linux
rdma-core-22.1-3.el7.x86_64 libibverbs-22.1-3.el7.x86_64 ibv_devinfo ที่ส่วนท้าย
System/RedHat ucx: /usr/bin/ucx_info -v # UCT version=1.4.0 revision 0000000 # configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check /usr/bin/ucx_info -d # Device: mlx5_0:1 # # capabilities: [1575906412.846334] [max-exfl200:137500:0] ib_iface.c:947 UCX ERROR Invalid active_width on mlx5_0:1: 16 # < failed to query interface > # # # Transport: ud # # Device: mlx5_0:1 # # capabilities: [1575906412.855268] [max-exfl200:137500:0] ib_iface.c:947 UCX ERROR Invalid active_width on mlx5_0:1: 16 # < failed to query interface > -------------------------------------------------------------------------------------------------------------------- Self-compiled 1.6.1: /software/ucx/1.6.1/bin/ucx_info -v # UCT version=1.6.1 revision # configured with: --prefix=/software/ucx/1.6.1 --with-cuda=/usr/local/cuda-9.0 --enable-mt # Device: ib0 # # capabilities: # bandwidth: 11142.51 MB/sec # latency: 5206 nsec # overhead: 50000 nsec # am_short: <= 8K # am_bcopy: <= 8K # connection: to iface # priority: 1 # device address: 4 bytes # iface address: 2 bytes # error handling: none # [1575906773.636572] [max-exfl200:138596:0] ib_mlx5_dv.c:157 UCX ERROR ibv_create_cq() failed: Invalid argument # < failed to open memory domain ib/mlx5_0 > -------------------------------------------------------------------------------------------------------------------- Self-compiled 1.7.0: /software/ucx/1.7.0/bin/ucx_info -v # UCT version=1.7.0 revision 7a5460d # configured with: --prefix=/software/ucx/1.7.0 and self-compiled git-master: /software/ucx/1.7.0/bin/ucx_info -v # UCT version=1.8.0 revision 0dc670d # configured with: --prefix=/software/ucx/1.7.0 --without-cuda give the same result: /software/ucx/1.7.0/bin/ucx_info -d # Memory domain: mlx5_0 # Component: ib # register: unlimited, cost: 90 nsec # remote key: 8 bytes # local memory handle is required for zcopy # # Transport: rc_verbs # Device: mlx5_0:1 [1575906843.530270] [max-exfl200:138808:0] ib_iface.c:629 UCX ERROR ibv_create_cq(cqe=4096) failed: Invalid argument # < failed to open interface > # # Transport: rc_mlx5 # Device: mlx5_0:1 [1575906843.532303] [max-exfl200:138808:0] ib_mlx5.c:73 UCX ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument # < failed to open interface > # # Transport: dc_mlx5 # Device: mlx5_0:1 [1575906843.534152] [max-exfl200:138808:0] ib_mlx5.c:73 UCX ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument # < failed to open interface > # # Transport: ud_verbs # Device: mlx5_0:1 [1575906843.534672] [max-exfl200:138808:0] ib_iface.c:629 UCX ERROR ibv_create_cq(cqe=256) failed: Invalid argument # < failed to open interface > # # Transport: ud_mlx5 # Device: mlx5_0:1 [1575906843.535087] [max-exfl200:138808:0] ib_mlx5.c:73 UCX ERROR mlx5dv_create_cq(cqe=256) failed: Invalid argument # < failed to open interface > # # Memory domain: rdmacm # Component: rdmacm # supports client-server connection establishment via sockaddr # < no supported devices found > -------------------------------------------------------------------------------------------------------------------- ibv_devinfo -vv hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 20.26.1040 node_guid: b859:9f03:004e:bd14 sys_image_guid: b859:9f03:004e:bd14 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000222 phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffffffffff000 max_qp: 262144 max_qp_wr: 32768 device_cap_flags: 0xe97e1c36 BAD_PKEY_CNTR BAD_QKEY_CNTR AUTO_PATH_MIG CHANGE_PHY_PORT PORT_ACTIVE_EVENT SYS_IMAGE_GUID RC_RNR_NAK_GEN MEM_WINDOW UD_IP_CSUM XRC MEM_MGT_EXTENSIONS MEM_WINDOW_TYPE_2B MANAGED_FLOW_STEERING Unknown flags: 0xC8480000 max_sge: 30 max_sge_rd: 30 max_cq: 16777216 max_cqe: 4194303 max_mr: 16777216 max_pd: 16777216 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4194304 max_qp_init_rd_atom: 16 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 16777216 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 2097152 max_mcast_qp_attach: 240 max_total_mcast_qp_attach: 503316480 max_ah: 2147483647 max_fmr: 0 max_srq: 8388608 max_srq_wr: 32767 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 16 general_odp_caps: ODP_SUPPORT Unknown flags: 0x2 rc_odp_caps: SUPPORT_SEND SUPPORT_RECV SUPPORT_WRITE SUPPORT_READ uc_odp_caps: NO SUPPORT ud_odp_caps: SUPPORT_SEND completion timestamp_mask: 0x7fffffffffffffff hca_core_clock: 156250kHZ device_cap_flags_ex: 0x11E97E1C36 PCI_WRITE_END_PADDING Unknown flags: 0x100000000 tso_caps: max_tso: 0 rss_caps: max_rwq_indirection_tables: 0 max_rwq_indirection_table_size: 0 rx_hash_function: 0x0 rx_hash_fields_mask: 0x0 max_wq_type_rq: 0 packet_pacing_caps: qp_rate_limit_min: 0kbps qp_rate_limit_max: 0kbps max_rndv_hdr_size: 64 max_num_tags: 127 max_ops: 32768 max_sge: 1 flags: IBV_TM_CAP_RC cq moderation caps: max_cq_count: 65535 max_cq_period: 4095 us maximum available device memory: 262144Bytes port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 17 port_lid: 102 port_lmc: 0x00 link_layer: InfiniBand max_msg_sz: 0x40000000 port_cap_flags: 0x2259e848 port_cap_flags2: 0x0000 max_vl_num: 4 (3) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 128 gid_tbl_len: 8 subnet_timeout: 18 init_type_reply: 0 active_width: 2X (16) active_speed: 50.0 Gbps (64) phys_state: LINK_UP (5) GID[ 0]: fe80:0000:0000:0000:b859:9f03:004e:bd14
Describe the bug
ucx_info -d shows various errors (depending on the ucx version) on nodes with connectx-6 hca. I tried several versions of ucx but didn't succeed using it with connectx-6. Before posting lots of details: is connectx-6 supported at all?