openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.16k stars 428 forks source link

no ucx with connectx-6 #4556

Closed schluenz closed 4 years ago

schluenz commented 4 years ago

Describe the bug

ucx_info -d shows various errors (depending on the ucx version) on nodes with connectx-6 hca. I tried several versions of ucx but didn't succeed using it with connectx-6. Before posting lots of details: is connectx-6 supported at all?

Some Errors for ucx_info:
ucx 1.4.0:
[1575906412.855268] [max-exfl200:137500:0]       ib_iface.c:947  UCX  ERROR Invalid active_width on mlx5_0:1: 16
ucx 1.6.1:
[1575906773.636572] [max-exfl200:138596:0]     ib_mlx5_dv.c:157  UCX  ERROR ibv_create_cq() failed: Invalid argument
ucx 1.7.0
[1575908314.020997] [max-exfl200:168320:0]       ib_iface.c:618  UCX  ERROR ibv_create_cq(cqe=4096) failed: Invalid argument

Short extract from ibv_devinfo -v:
hca_id:        mlx5_0
        transport:                        InfiniBand (0)
        fw_ver:                                20.26.1040
        node_guid:                        b859:9f03:004e:bd14
        sys_image_guid:                        b859:9f03:004e:bd14
        vendor_id:                        0x02c9
        vendor_part_id:                        4123
        hw_ver:                                0x0
        board_id:                        MT_0000000222
        phys_port_cnt:                        1
        max_mr_size:                        0xffffffffffffffff
        page_size_cap:                        0xfffffffffffff000
        max_qp:                                262144
        max_qp_wr:                        32768
yosefe commented 4 years ago

@schluenz can you pls provide OS and driver information as described in the GH issue template

schluenz commented 4 years ago

ok, more complete info:

CentOS Linux release 7.7.1908 (Core) Linux max-exfl200.desy.de 3.10.0-1062.4.1.el7.x86_64 #1 SMP Fri Oct 18 17:15:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

rdma-core-22.1-3.el7.x86_64 libibverbs-22.1-3.el7.x86_64 ibv_devinfo at the end


System/RedHat ucx:
/usr/bin/ucx_info -v
# UCT version=1.4.0 revision 0000000
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check

/usr/bin/ucx_info -d
#   Device: mlx5_0:1
#
#      capabilities:
[1575906412.846334] [max-exfl200:137500:0]       ib_iface.c:947  UCX  ERROR Invalid active_width on mlx5_0:1: 16
#   < failed to query interface >
#
#
#   Transport: ud
#
#   Device: mlx5_0:1
#
#      capabilities:
[1575906412.855268] [max-exfl200:137500:0]       ib_iface.c:947  UCX  ERROR Invalid active_width on mlx5_0:1: 16
#   < failed to query interface >
--------------------------------------------------------------------------------------------------------------------

Self-compiled 1.6.1:
/software/ucx/1.6.1/bin/ucx_info -v
# UCT version=1.6.1 revision
# configured with: --prefix=/software/ucx/1.6.1 --with-cuda=/usr/local/cuda-9.0 --enable-mt

#   Device: ib0
#
#      capabilities:
#            bandwidth: 11142.51 MB/sec
#              latency: 5206 nsec
#             overhead: 50000 nsec
#             am_short: <= 8K
#             am_bcopy: <= 8K
#           connection: to iface
#             priority: 1
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
[1575906773.636572] [max-exfl200:138596:0]     ib_mlx5_dv.c:157  UCX  ERROR ibv_create_cq() failed: Invalid argument
# < failed to open memory domain ib/mlx5_0 >
--------------------------------------------------------------------------------------------------------------------

Self-compiled 1.7.0:
/software/ucx/1.7.0/bin/ucx_info -v
# UCT version=1.7.0 revision 7a5460d
# configured with: --prefix=/software/ucx/1.7.0

and self-compiled git-master:
/software/ucx/1.7.0/bin/ucx_info -v
# UCT version=1.8.0 revision 0dc670d
# configured with: --prefix=/software/ucx/1.7.0 --without-cuda

give the same result:

/software/ucx/1.7.0/bin/ucx_info -d
# Memory domain: mlx5_0
#     Component: ib
#             register: unlimited, cost: 90 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#
#   Transport: rc_verbs
#      Device: mlx5_0:1
[1575906843.530270] [max-exfl200:138808:0]       ib_iface.c:629  UCX  ERROR ibv_create_cq(cqe=4096) failed: Invalid argument
#   < failed to open interface >
#
#   Transport: rc_mlx5
#      Device: mlx5_0:1
[1575906843.532303] [max-exfl200:138808:0]        ib_mlx5.c:73   UCX  ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument
#   < failed to open interface >
#
#   Transport: dc_mlx5
#      Device: mlx5_0:1
[1575906843.534152] [max-exfl200:138808:0]        ib_mlx5.c:73   UCX  ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument
#   < failed to open interface >
#
#   Transport: ud_verbs
#      Device: mlx5_0:1
[1575906843.534672] [max-exfl200:138808:0]       ib_iface.c:629  UCX  ERROR ibv_create_cq(cqe=256) failed: Invalid argument
#   < failed to open interface >
#
#   Transport: ud_mlx5
#      Device: mlx5_0:1
[1575906843.535087] [max-exfl200:138808:0]        ib_mlx5.c:73   UCX  ERROR mlx5dv_create_cq(cqe=256) failed: Invalid argument
#   < failed to open interface >
#
# Memory domain: rdmacm
#     Component: rdmacm
#           supports client-server connection establishment via sockaddr
#   < no supported devices found >

--------------------------------------------------------------------------------------------------------------------

ibv_devinfo -vv
hca_id: mlx5_0
    transport:          InfiniBand (0)
    fw_ver:             20.26.1040
    node_guid:          b859:9f03:004e:bd14
    sys_image_guid:         b859:9f03:004e:bd14
    vendor_id:          0x02c9
    vendor_part_id:         4123
    hw_ver:             0x0
    board_id:           MT_0000000222
    phys_port_cnt:          1
    max_mr_size:            0xffffffffffffffff
    page_size_cap:          0xfffffffffffff000
    max_qp:             262144
    max_qp_wr:          32768
    device_cap_flags:       0xe97e1c36
                    BAD_PKEY_CNTR
                    BAD_QKEY_CNTR
                    AUTO_PATH_MIG
                    CHANGE_PHY_PORT
                    PORT_ACTIVE_EVENT
                    SYS_IMAGE_GUID
                    RC_RNR_NAK_GEN
                    MEM_WINDOW
                    UD_IP_CSUM
                    XRC
                    MEM_MGT_EXTENSIONS
                    MEM_WINDOW_TYPE_2B
                    MANAGED_FLOW_STEERING
                    Unknown flags: 0xC8480000
    max_sge:            30
    max_sge_rd:         30
    max_cq:             16777216
    max_cqe:            4194303
    max_mr:             16777216
    max_pd:             16777216
    max_qp_rd_atom:         16
    max_ee_rd_atom:         0
    max_res_rd_atom:        4194304
    max_qp_init_rd_atom:        16
    max_ee_init_rd_atom:        0
    atomic_cap:         ATOMIC_HCA (1)
    max_ee:             0
    max_rdd:            0
    max_mw:             16777216
    max_raw_ipv6_qp:        0
    max_raw_ethy_qp:        0
    max_mcast_grp:          2097152
    max_mcast_qp_attach:        240
    max_total_mcast_qp_attach:  503316480
    max_ah:             2147483647
    max_fmr:            0
    max_srq:            8388608
    max_srq_wr:         32767
    max_srq_sge:            31
    max_pkeys:          128
    local_ca_ack_delay:     16
    general_odp_caps:
                    ODP_SUPPORT
                    Unknown flags: 0x2
    rc_odp_caps:
                    SUPPORT_SEND
                    SUPPORT_RECV
                    SUPPORT_WRITE
                    SUPPORT_READ
    uc_odp_caps:
                    NO SUPPORT
    ud_odp_caps:
                    SUPPORT_SEND
    completion timestamp_mask:          0x7fffffffffffffff
    hca_core_clock:         156250kHZ
    device_cap_flags_ex:        0x11E97E1C36
                    PCI_WRITE_END_PADDING
                    Unknown flags: 0x100000000
    tso_caps:
    max_tso:            0
    rss_caps:
        max_rwq_indirection_tables:         0
        max_rwq_indirection_table_size:         0
        rx_hash_function:               0x0
        rx_hash_fields_mask:                0x0
    max_wq_type_rq:         0
    packet_pacing_caps:
        qp_rate_limit_min:  0kbps
        qp_rate_limit_max:  0kbps
    max_rndv_hdr_size:      64
    max_num_tags:           127
    max_ops:            32768
    max_sge:            1
    flags:
                    IBV_TM_CAP_RC

    cq moderation caps:
        max_cq_count:   65535
        max_cq_period:  4095 us

    maximum available device memory:    262144Bytes

        port:   1
            state:          PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:     4096 (5)
            sm_lid:         17
            port_lid:       102
            port_lmc:       0x00
            link_layer:     InfiniBand
            max_msg_sz:     0x40000000
            port_cap_flags:     0x2259e848
            port_cap_flags2:    0x0000
            max_vl_num:     4 (3)
            bad_pkey_cntr:      0x0
            qkey_viol_cntr:     0x0
            sm_sl:          0
            pkey_tbl_len:       128
            gid_tbl_len:        8
            subnet_timeout:     18
            init_type_reply:    0
            active_width:       2X (16)
            active_speed:       50.0 Gbps (64)
            phys_state:     LINK_UP (5)
            GID[  0]:       fe80:0000:0000:0000:b859:9f03:004e:bd14
yosefe commented 4 years ago

Thanks, it seems similar to the issue reported on the mailing list , @Artemy-Mellanox pls take a look

Artemy-Mellanox commented 4 years ago

@schluenz could you please also check ulimit -a

schluenz commented 4 years ago
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 3088715
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 3088715
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Artemy-Mellanox commented 4 years ago

Need to collect some debug info: enable dynamic debug echo module mlx5_ib | sudo tee /sys/kernel/debug/dynamic_debug/control run ucx_info with strace strace /software/ucx/1.7.0/bin/ucx_info -d please collect ucx_info output and dmesg and send us

schluenz commented 4 years ago

I did echo "module mlx5_ib +p" | tee /sys/kernel/debug/dynamic_debug/control strace -o ucx_info.strace.txt /software/ucx/1.7.0/bin/ucx_info -d > ucx_info.out.txt dmesg -T | grep "everything after adding mlx5_ib" > ucx_info.dmesg.txt

outputs are attached. dmesg contains three ucx_info -d runs.

ucx_info.dmesg.txt ucx_info.out.txt ucx_info.strace.txt

Artemy-Mellanox commented 4 years ago

thanks. Could you please repeat it with more debug: echo "module mlx5_core +p" | tee /sys/kernel/debug/dynamic_debug/control and strace -s2048 ...

schluenz commented 4 years ago

attached, just the strace -s2048 output

ucx_info.strace-2048.txt

Artemy-Mellanox commented 4 years ago

mlx5_core dynamic_debug dmesg actually would be more informative

schluenz commented 4 years ago

sorry, attached

ucx_info.dmesg-2048.txt

Artemy-Mellanox commented 4 years ago

Could you please do following check: look at dmesg snap dmest | tail -500 | grep INPUT

From log you attached output is:

[Tue Dec 10 12:14:03 2019] mlx5_core 0000:5e:00.0: dump_command:772:(pid 239539): dump command MODIFY_CQ(0x403) INPUT [Tue Dec 10 12:14:03 2019] mlx5_core 0000:5e:00.0: dump_command:772:(pid 239539): dump command MODIFY_CQ(0x403) INPUT [Tue Dec 10 12:14:03 2019] mlx5_core 0000:5e:00.0: dump_command:772:(pid 239539): dump command MODIFY_CQ(0x403) INPUT [Tue Dec 10 12:14:03 2019] mlx5_core 0000:5e:00.0: dump_command:772:(pid 239539): dump command MODIFY_CQ(0x403) INPUT [Tue Dec 10 12:14:03 2019] mlx5_core 0000:5e:00.0: dump_command:772:(pid 239539): dump command MODIFY_CQ(0x403) INPUT

Please check with ps auxww | grep 239539 (use pid from dmesg). I want to know where those commands come from.

schluenz commented 4 years ago

[3086989.576394] mlx5_core 0000:5e:00.0: dump_command:772:(pid 33983): dump command MODIFY_CQ(0x403) INPUT

ps auxww | grep 33983 root 33983 0.0 0.0 0 0 ? S 09:51 0:00 [kworker/u592:1]

Artemy-Mellanox commented 4 years ago

May it be that second port is in RoCE mode and is active? Can we check it ibv_devinfo and ifconfig and shut it down - to silence it's debug traffic so we could see debug output related to ucx_info command.

schluenz commented 4 years ago

not sure I understand the question correctly.

It's a single-port adapter: Product Name: ConnectX-6 VPI adapter card, 100Gb/s (HDR100, EDR IB and 100GbE), single-port QSFP56,
The nodes are connected to HDR200 switches over split-port cables (hdr200->2x100).
I don't see any indication for RoCE being active. We unavoidably use ipoib.

ip addr sh
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether a4:bf:01:5c:45:ad brd ff:ff:ff:ff:ff:ff
    inet 131.169.179.44/24 brd 131.169.179.255 scope global noprefixroute dynamic eno1
       valid_lft 400059sec preferred_lft 400059sec
    inet6 fe80::a6bf:1ff:fe5c:45ad/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether a4:bf:01:5c:45:ae brd ff:ff:ff:ff:ff:ff
4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:03:40:fe:80:00:00:00:00:00:00:b8:59:9f:03:00:4e:bd:14 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 10.255.33.7/20 brd 10.255.47.255 scope global noprefixroute ib0
       valid_lft forever preferred_lft forever
    inet6 fe80::ba59:9f03:4e:bd14/64 scope link 
       valid_lft forever preferred_lft forever
5: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:d2:75:94:92 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
Artemy-Mellanox commented 4 years ago

Could you please try simple ibv_rc_pingpong loopback test

schluenz commented 4 years ago

output looks like this: local address: LID 0x0068, QPN 0x002035, PSN 0xefdd06, GID :: remote address: LID 0x0068, QPN 0x002035, PSN 0xefdd06, GID :: remote address: LID 0x0068, QPN 0x002034, PSN 0xd4cad5, GID :: 8192000 bytes in 0.01 seconds = 11336.45 Mbit/sec 8192000 bytes in 0.01 seconds = 11793.41 Mbit/sec 1000 iters in 0.01 seconds = 5.78 usec/iter 1000 iters in 0.01 seconds = 5.56 usec/iter

ps: I got the hint that the RedHat 7.6 hpc-x inbox still works, which is indeed the case. hpcx-v2.5.0-gcc-inbox-redhat7.6-x86_64 ucx_info -d works as expected. hpcx-v2.5.0-gcc-inbox-redhat7.7-x86_64 ucx_info -d throws errors like mentioned earlier: ib_iface.c:618 UCX ERROR ibv_create_cq(cqe=4096) failed: Invalid argument ib_mlx5.c:73 UCX ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument

alex--m commented 4 years ago

@Artemy-Mellanox @schluenz I've got the same problem - let me know if there's anything I can do to help debug this... BTW reproduces with Ethernet (RoCE) on the latest CX6 firmware (20.26.4012) and UCX version (master).

Artemy-Mellanox commented 4 years ago

Could you please build UCX from this debug branch and try again: https://github.com/Artemy-Mellanox/ucx/tree/dbg-4556

alex--m commented 4 years ago

@Artemy-Mellanox

> #> UCX_LOG_LEVEL=debug ./build/bin/ucx_info -d

...

# Memory domain: mlx5_0
#     Component: ib
#             register: unlimited, cost: 90 nsec
#           remote key: 24 bytes
#           local memory handle is required for zcopy
#
[1577793869.845859] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.845878] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
#   Transport: rc_verbs
#      Device: mlx5_0:1
[1577793869.846028] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.846044] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1577793869.846169] [thunder6:167167:0]       ib_iface.c:466  UCX  DEBUG using pkey[0] 0xffff on mlx5_0:1
[1577793869.846276] [thunder6:167167:0]      ib_device.c:529  UCX  DEBUG testing addr_family on gid index 1: fe80::1e34:daff:fe66:fd34
[1577793869.846363] [thunder6:167167:0]      ib_device.c:529  UCX  DEBUG testing addr_family on gid index 3: ::ffff:192.168.30.16
[1577793869.846473] [thunder6:167167:0]      ib_device.c:671  UCX  DEBUG mlx5_0:1 using gid_index 3
[1577793869.846537] [thunder6:167167:0]       ib_iface.c:516  UCX  DEBUG Not using value 1 for path_bits - must be < 2^lmc (lmc=0)
[1577793869.847324] [thunder6:167167:0]          mpool.c:205  UCX  DEBUG mpool rcache_inv_mp: allocated chunk 0x7ff4453b2008 of 36856 bytes with 1151 elements
uct_ib_verbs_create_cq:626 (nil) 22
uct_ib_verbs_create_cq:636 (nil)
[1577793869.848130] [thunder6:167167:0]       ib_iface.c:638  UCX  ERROR ibv_create_cq(cqe=4096) failed: Invalid argument
#   < failed to open interface >
#
[1577793869.848218] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.848236] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
#   Transport: rc_mlx5
#      Device: mlx5_0:1
[1577793869.848380] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.848401] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1577793869.848503] [thunder6:167167:0]       ib_iface.c:466  UCX  DEBUG using pkey[0] 0xffff on mlx5_0:1
[1577793869.848600] [thunder6:167167:0]      ib_device.c:529  UCX  DEBUG testing addr_family on gid index 1: fe80::1e34:daff:fe66:fd34
[1577793869.848690] [thunder6:167167:0]      ib_device.c:529  UCX  DEBUG testing addr_family on gid index 3: ::ffff:192.168.30.16
[1577793869.848757] [thunder6:167167:0]      ib_device.c:671  UCX  DEBUG mlx5_0:1 using gid_index 3
[1577793869.848816] [thunder6:167167:0]       ib_iface.c:516  UCX  DEBUG Not using value 1 for path_bits - must be < 2^lmc (lmc=0)
[1577793869.848853] [thunder6:167167:0]      ib_device.c:1067 UCX  DEBUG max IB CQE size is 128
[1577793869.850178] [thunder6:167167:0]        ib_mlx5.c:73   UCX  ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument
#   < failed to open interface >
#
[1577793869.850264] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.850283] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
#   Transport: dc_mlx5
#      Device: mlx5_0:1
[1577793869.850437] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.850453] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1577793869.850569] [thunder6:167167:0]       ib_iface.c:466  UCX  DEBUG using pkey[0] 0xffff on mlx5_0:1
[1577793869.850664] [thunder6:167167:0]      ib_device.c:529  UCX  DEBUG testing addr_family on gid index 1: fe80::1e34:daff:fe66:fd34
[1577793869.850753] [thunder6:167167:0]      ib_device.c:529  UCX  DEBUG testing addr_family on gid index 3: ::ffff:192.168.30.16
[1577793869.850822] [thunder6:167167:0]      ib_device.c:671  UCX  DEBUG mlx5_0:1 using gid_index 3
[1577793869.850880] [thunder6:167167:0]       ib_iface.c:516  UCX  DEBUG Not using value 1 for path_bits - must be < 2^lmc (lmc=0)
[1577793869.852090] [thunder6:167167:0]        ib_mlx5.c:73   UCX  ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument
#   < failed to open interface >
#
[1577793869.852163] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.852181] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
#   Transport: ud_verbs
#      Device: mlx5_0:1
[1577793869.852290] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.852310] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1577793869.852420] [thunder6:167167:0]       ib_iface.c:466  UCX  DEBUG using pkey[0] 0xffff on mlx5_0:1
[1577793869.852515] [thunder6:167167:0]      ib_device.c:529  UCX  DEBUG testing addr_family on gid index 1: fe80::1e34:daff:fe66:fd34
[1577793869.852604] [thunder6:167167:0]      ib_device.c:529  UCX  DEBUG testing addr_family on gid index 3: ::ffff:192.168.30.16
[1577793869.852673] [thunder6:167167:0]      ib_device.c:671  UCX  DEBUG mlx5_0:1 using gid_index 3
[1577793869.852726] [thunder6:167167:0]       ib_iface.c:516  UCX  DEBUG Not using value 1 for path_bits - must be < 2^lmc (lmc=0)
uct_ib_verbs_create_cq:626 (nil) 22
uct_ib_verbs_create_cq:636 (nil)
[1577793869.853124] [thunder6:167167:0]       ib_iface.c:638  UCX  ERROR ibv_create_cq(cqe=256) failed: Invalid argument
#   < failed to open interface >
#
[1577793869.853179] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.853196] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
#   Transport: ud_mlx5
#      Device: mlx5_0:1
[1577793869.853300] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: cuda GPUDirect RDMA is disabled
[1577793869.853316] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1577793869.853396] [thunder6:167167:0]       ib_iface.c:466  UCX  DEBUG using pkey[0] 0xffff on mlx5_0:1
[1577793869.853486] [thunder6:167167:0]      ib_device.c:529  UCX  DEBUG testing addr_family on gid index 1: fe80::1e34:daff:fe66:fd34
[1577793869.853576] [thunder6:167167:0]      ib_device.c:529  UCX  DEBUG testing addr_family on gid index 3: ::ffff:192.168.30.16
[1577793869.853640] [thunder6:167167:0]      ib_device.c:671  UCX  DEBUG mlx5_0:1 using gid_index 3
[1577793869.853698] [thunder6:167167:0]       ib_iface.c:516  UCX  DEBUG Not using value 1 for path_bits - must be < 2^lmc (lmc=0)
[1577793869.853864] [thunder6:167167:0]        ib_mlx5.c:73   UCX  ERROR mlx5dv_create_cq(cqe=256) failed: Invalid argument
#   < failed to open interface >
[1577793869.854050] [thunder6:167167:0]          mpool.c:142  UCX  DEBUG mpool devx dbrec destroyed
[1577793869.854123] [thunder6:167167:0]          mpool.c:142  UCX  DEBUG mpool rcache_inv_mp destroyed
[1577793869.854217] [thunder6:167167:0]      ib_device.c:388  UCX  DEBUG destroying ib device mlx5_0
[1577793869.860968] [thunder6:167167:0]      ib_device.c:273  UCX  DEBUG mlx5_1 vendor_id: 0x15b3 device_id: 4123
[1577793869.861165] [thunder6:167167:0]   ib_mlx5dv_md.c:447  UCX  DEBUG mlx5_1: disable ODP on RoCE
[1577793869.861361] [thunder6:167167:0]      ib_device.c:368  UCX  DEBUG initialized device 'mlx5_1' (InfiniBand channel adapter) with 1 ports
[1577793869.861450] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_1: cuda GPUDirect RDMA is disabled
[1577793869.861495] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_1: rocm GPUDirect RDMA is disabled
[1577793869.861520] [thunder6:167167:0]          mpool.c:88   UCX  DEBUG mpool rcache_inv_mp: align 1, maxelems 4294967295, elemsize 32
[1577793869.861614] [thunder6:167167:0]          ib_md.c:1160 UCX  DEBUG mlx5_1: using registration cache
[1577793869.861706] [thunder6:167167:0]          mpool.c:88   UCX  DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1577793869.861920] [thunder6:167167:0]      ib_device.c:827  UCX  DEBUG no compatible IB ports found for flags 0x0
[1577793869.861955] [thunder6:167167:0]         uct_md.c:85   UCX  DEBUG failed to query rc_verbs resources: No such device
[1577793869.861978] [thunder6:167167:0]      ib_device.c:827  UCX  DEBUG no compatible IB ports found for flags 0x4
[1577793869.861997] [thunder6:167167:0]         uct_md.c:85   UCX  DEBUG failed to query rc_mlx5 resources: No such device
[1577793869.862012] [thunder6:167167:0]      ib_device.c:827  UCX  DEBUG no compatible IB ports found for flags 0xc4
[1577793869.862025] [thunder6:167167:0]         uct_md.c:85   UCX  DEBUG failed to query dc_mlx5 resources: No such device
[1577793869.862045] [thunder6:167167:0]      ib_device.c:827  UCX  DEBUG no compatible IB ports found for flags 0x0
[1577793869.862061] [thunder6:167167:0]         uct_md.c:85   UCX  DEBUG failed to query ud_verbs resources: No such device
[1577793869.862074] [thunder6:167167:0]      ib_device.c:827  UCX  DEBUG no compatible IB ports found for flags 0x4
[1577793869.862087] [thunder6:167167:0]         uct_md.c:85   UCX  DEBUG failed to query ud_mlx5 resources: No such device
[1577793869.862108] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_1: cuda GPUDirect RDMA is disabled
[1577793869.862126] [thunder6:167167:0]          ib_md.c:266  UCX  DEBUG mlx5_1: rocm GPUDirect RDMA is disabled
#
# Memory domain: mlx5_1
...
Artemy-Mellanox commented 4 years ago

@alex--m , could you please try again, force-pushed more debug

alex--m commented 4 years ago

@Artemy-Mellanox

# Memory domain: mlx5_0
#     Component: ib
#             register: unlimited, cost: 90 nsec
#           remote key: 24 bytes
#           local memory handle is required for zcopy
#
#   Transport: rc_verbs
#      Device: mlx5_0:1
uct_ib_iface_t_init:858 0x12ebce0
uct_ib_verbs_create_cq:626 (nil) 22
uct_ib_verbs_create_cq:636 (nil) 0x12ebce0 0
[1577801148.897735] [thunder2:8084 :0]       ib_iface.c:638  UCX  ERROR ibv_create_cq(cqe=4096) failed: Invalid argument
#   < failed to open interface >
#
#   Transport: rc_mlx5
#      Device: mlx5_0:1
uct_ib_iface_t_init:858 0x12ebce0
[1577801148.902712] [thunder2:8084 :0]        ib_mlx5.c:73   UCX  ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument
#   < failed to open interface >
#
#   Transport: dc_mlx5
#      Device: mlx5_0:1
uct_ib_iface_t_init:858 0x12ebce0
[1577801148.907219] [thunder2:8084 :0]        ib_mlx5.c:73   UCX  ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument
#   < failed to open interface >
#
#   Transport: ud_verbs
#      Device: mlx5_0:1
uct_ib_iface_t_init:858 0x12ebce0
uct_ib_verbs_create_cq:626 (nil) 22
uct_ib_verbs_create_cq:636 (nil) 0x12ebce0 0
[1577801148.910754] [thunder2:8084 :0]       ib_iface.c:638  UCX  ERROR ibv_create_cq(cqe=256) failed: Invalid argument
#   < failed to open interface >
#
#   Transport: ud_mlx5
#      Device: mlx5_0:1
uct_ib_iface_t_init:858 0x12ebce0
[1577801148.914428] [thunder2:8084 :0]        ib_mlx5.c:73   UCX  ERROR mlx5dv_create_cq(cqe=256) failed: Invalid argument
#   < failed to open interface >
Artemy-Mellanox commented 4 years ago

@alex--m please try echo "module mlx5_core +p" | sudo tee /sys/kernel/debug/dynamic_debug/control , run command and send me dmesg

alex--m commented 4 years ago

@Artemy-Mellanox dmesg_afer_ucx_info.thunder6.txt dmesg.thunder6.txt

And also - С новым годом! :)

Artemy-Mellanox commented 4 years ago

С новым годом! ;) please send output of sudo find /sys/kernel/debug/mlx5/ -path *EQ* -type f -print -exec cat {} \;

alex--m commented 4 years ago

@Artemy-Mellanox output.txt

Artemy-Mellanox commented 4 years ago

please add UCX_IB_MLX5_DEVX=n environment variable and rerun the command

alex--m commented 4 years ago

@Artemy-Mellanox Do you mean the ucx_info command? (see below)

# Memory domain: mlx5_0
#     Component: ib
#             register: unlimited, cost: 90 nsec
#           remote key: 24 bytes
#           local memory handle is required for zcopy
#
#   Transport: rc_verbs
#      Device: mlx5_0:1
uct_ib_iface_t_init:858 0x8f3810
uct_ib_verbs_create_cq:626 0x91afd0 93
uct_ib_verbs_create_cq:636 0x91afd0 0x8f3810 0
uct_ib_verbs_create_cq:626 0x91b1f0 93
uct_ib_verbs_create_cq:636 0x91b1f0 0x8f3810 0
#
#      capabilities:
#            bandwidth: 0.00 + 10957.84 MB/sec
#              latency: 800 nsec + 1 * N
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8K
#            put_zcopy: <= 1G, up to 3 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 1K
#            get_bcopy: <= 8K
#            get_zcopy: 65..1G, up to 3 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 1K
#             am_short: <= 123
#             am_bcopy: <= 8191
#             am_zcopy: <= 8191, up to 2 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#             priority: 50
#       device address: 17 bytes
#           ep address: 3 bytes
#       error handling: peer failure
#
#
#   Transport: rc_mlx5
#      Device: mlx5_0:1
uct_ib_iface_t_init:858 0x911f20
#
#      capabilities:
#            bandwidth: 0.00 + 10957.84 MB/sec
#              latency: 800 nsec + 1 * N
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8K
#            put_zcopy: <= 1G, up to 8 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 1K
#            get_bcopy: <= 8K
#            get_zcopy: 65..1G, up to 8 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 1K
#             am_short: <= 2046
#             am_bcopy: <= 8190
#             am_zcopy: <= 8190, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 186
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#             priority: 50
#       device address: 17 bytes
#           ep address: 7 bytes
#       error handling: buffer (zcopy), remote access, peer failure
#
#
#   Transport: ud_verbs
#      Device: mlx5_0:1
uct_ib_iface_t_init:858 0x911f20
uct_ib_verbs_create_cq:626 0x915580 0
uct_ib_verbs_create_cq:636 0x915580 0x911f20 0
uct_ib_verbs_create_cq:626 0x9157a0 0
uct_ib_verbs_create_cq:636 0x9157a0 0x911f20 0
#
#      capabilities:
#            bandwidth: 0.00 + 10957.84 MB/sec
#              latency: 810 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 1016
#             am_zcopy: <= 1016, up to 1 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 912
#           connection: to ep, to iface
#             priority: 50
#       device address: 17 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure
#
#
#   Transport: ud_mlx5
#      Device: mlx5_0:1
uct_ib_iface_t_init:858 0x911f20
#
#      capabilities:
#            bandwidth: 0.00 + 10957.84 MB/sec
#              latency: 810 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 1016
#             am_zcopy: <= 1016, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 132
#           connection: to ep, to iface
#             priority: 50
#       device address: 17 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure
#
#
Artemy-Mellanox commented 4 years ago

@schluenz Ok, that was the problem - on centos7.7 kernel UCX wrongly detects DEVX support. @yosefe We will come with proper fix, meanwhile UCX_IB_MLX5_DEVX=n is workaround. @alex--m Please pull master branch to remove my debug prints.

yosefe commented 4 years ago

Thanks @Artemy-Mellanox !

FaDee1 commented 4 years ago

ตกลงข้อมูลที่สมบูรณ์มากขึ้น:

CentOS Linux ออก 7.7.1908 (Core) Linux max-exfl200.desy.de 3.10.0-1062.4.1.el7.x86_64 # 1 SMP ศุกร์ 18 ต.ค. 17:15:30 UTC 2019 x86_64 x86_64 x86_64 GNU / Linux

rdma-core-22.1-3.el7.x86_64 libibverbs-22.1-3.el7.x86_64 ibv_devinfo ที่ส่วนท้าย


System/RedHat ucx:
/usr/bin/ucx_info -v
# UCT version=1.4.0 revision 0000000
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check

/usr/bin/ucx_info -d
#   Device: mlx5_0:1
#
#      capabilities:
[1575906412.846334] [max-exfl200:137500:0]       ib_iface.c:947  UCX  ERROR Invalid active_width on mlx5_0:1: 16
#   < failed to query interface >
#
#
#   Transport: ud
#
#   Device: mlx5_0:1
#
#      capabilities:
[1575906412.855268] [max-exfl200:137500:0]       ib_iface.c:947  UCX  ERROR Invalid active_width on mlx5_0:1: 16
#   < failed to query interface >
--------------------------------------------------------------------------------------------------------------------

Self-compiled 1.6.1:
/software/ucx/1.6.1/bin/ucx_info -v
# UCT version=1.6.1 revision
# configured with: --prefix=/software/ucx/1.6.1 --with-cuda=/usr/local/cuda-9.0 --enable-mt

#   Device: ib0
#
#      capabilities:
#            bandwidth: 11142.51 MB/sec
#              latency: 5206 nsec
#             overhead: 50000 nsec
#             am_short: <= 8K
#             am_bcopy: <= 8K
#           connection: to iface
#             priority: 1
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
[1575906773.636572] [max-exfl200:138596:0]     ib_mlx5_dv.c:157  UCX  ERROR ibv_create_cq() failed: Invalid argument
# < failed to open memory domain ib/mlx5_0 >
--------------------------------------------------------------------------------------------------------------------

Self-compiled 1.7.0:
/software/ucx/1.7.0/bin/ucx_info -v
# UCT version=1.7.0 revision 7a5460d
# configured with: --prefix=/software/ucx/1.7.0

and self-compiled git-master:
/software/ucx/1.7.0/bin/ucx_info -v
# UCT version=1.8.0 revision 0dc670d
# configured with: --prefix=/software/ucx/1.7.0 --without-cuda

give the same result:

/software/ucx/1.7.0/bin/ucx_info -d
# Memory domain: mlx5_0
#     Component: ib
#             register: unlimited, cost: 90 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#
#   Transport: rc_verbs
#      Device: mlx5_0:1
[1575906843.530270] [max-exfl200:138808:0]       ib_iface.c:629  UCX  ERROR ibv_create_cq(cqe=4096) failed: Invalid argument
#   < failed to open interface >
#
#   Transport: rc_mlx5
#      Device: mlx5_0:1
[1575906843.532303] [max-exfl200:138808:0]        ib_mlx5.c:73   UCX  ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument
#   < failed to open interface >
#
#   Transport: dc_mlx5
#      Device: mlx5_0:1
[1575906843.534152] [max-exfl200:138808:0]        ib_mlx5.c:73   UCX  ERROR mlx5dv_create_cq(cqe=4096) failed: Invalid argument
#   < failed to open interface >
#
#   Transport: ud_verbs
#      Device: mlx5_0:1
[1575906843.534672] [max-exfl200:138808:0]       ib_iface.c:629  UCX  ERROR ibv_create_cq(cqe=256) failed: Invalid argument
#   < failed to open interface >
#
#   Transport: ud_mlx5
#      Device: mlx5_0:1
[1575906843.535087] [max-exfl200:138808:0]        ib_mlx5.c:73   UCX  ERROR mlx5dv_create_cq(cqe=256) failed: Invalid argument
#   < failed to open interface >
#
# Memory domain: rdmacm
#     Component: rdmacm
#           supports client-server connection establishment via sockaddr
#   < no supported devices found >

--------------------------------------------------------------------------------------------------------------------

ibv_devinfo -vv
hca_id:   mlx5_0
  transport:          InfiniBand (0)
  fw_ver:             20.26.1040
  node_guid:          b859:9f03:004e:bd14
  sys_image_guid:         b859:9f03:004e:bd14
  vendor_id:          0x02c9
  vendor_part_id:         4123
  hw_ver:             0x0
  board_id:           MT_0000000222
  phys_port_cnt:          1
  max_mr_size:            0xffffffffffffffff
  page_size_cap:          0xfffffffffffff000
  max_qp:             262144
  max_qp_wr:          32768
  device_cap_flags:       0xe97e1c36
                  BAD_PKEY_CNTR
                  BAD_QKEY_CNTR
                  AUTO_PATH_MIG
                  CHANGE_PHY_PORT
                  PORT_ACTIVE_EVENT
                  SYS_IMAGE_GUID
                  RC_RNR_NAK_GEN
                  MEM_WINDOW
                  UD_IP_CSUM
                  XRC
                  MEM_MGT_EXTENSIONS
                  MEM_WINDOW_TYPE_2B
                  MANAGED_FLOW_STEERING
                  Unknown flags: 0xC8480000
  max_sge:            30
  max_sge_rd:         30
  max_cq:             16777216
  max_cqe:            4194303
  max_mr:             16777216
  max_pd:             16777216
  max_qp_rd_atom:         16
  max_ee_rd_atom:         0
  max_res_rd_atom:        4194304
  max_qp_init_rd_atom:        16
  max_ee_init_rd_atom:        0
  atomic_cap:         ATOMIC_HCA (1)
  max_ee:             0
  max_rdd:            0
  max_mw:             16777216
  max_raw_ipv6_qp:        0
  max_raw_ethy_qp:        0
  max_mcast_grp:          2097152
  max_mcast_qp_attach:        240
  max_total_mcast_qp_attach:  503316480
  max_ah:             2147483647
  max_fmr:            0
  max_srq:            8388608
  max_srq_wr:         32767
  max_srq_sge:            31
  max_pkeys:          128
  local_ca_ack_delay:     16
  general_odp_caps:
                  ODP_SUPPORT
                  Unknown flags: 0x2
  rc_odp_caps:
                  SUPPORT_SEND
                  SUPPORT_RECV
                  SUPPORT_WRITE
                  SUPPORT_READ
  uc_odp_caps:
                  NO SUPPORT
  ud_odp_caps:
                  SUPPORT_SEND
  completion timestamp_mask:          0x7fffffffffffffff
  hca_core_clock:         156250kHZ
  device_cap_flags_ex:        0x11E97E1C36
                  PCI_WRITE_END_PADDING
                  Unknown flags: 0x100000000
  tso_caps:
  max_tso:            0
  rss_caps:
      max_rwq_indirection_tables:         0
      max_rwq_indirection_table_size:         0
      rx_hash_function:               0x0
      rx_hash_fields_mask:                0x0
  max_wq_type_rq:         0
  packet_pacing_caps:
      qp_rate_limit_min:  0kbps
      qp_rate_limit_max:  0kbps
  max_rndv_hdr_size:      64
  max_num_tags:           127
  max_ops:            32768
  max_sge:            1
  flags:
                  IBV_TM_CAP_RC

  cq moderation caps:
      max_cq_count:   65535
      max_cq_period:  4095 us

  maximum available device memory:    262144Bytes

      port:   1
          state:          PORT_ACTIVE (4)
          max_mtu:        4096 (5)
          active_mtu:     4096 (5)
          sm_lid:         17
          port_lid:       102
          port_lmc:       0x00
          link_layer:     InfiniBand
          max_msg_sz:     0x40000000
          port_cap_flags:     0x2259e848
          port_cap_flags2:    0x0000
          max_vl_num:     4 (3)
          bad_pkey_cntr:      0x0
          qkey_viol_cntr:     0x0
          sm_sl:          0
          pkey_tbl_len:       128
          gid_tbl_len:        8
          subnet_timeout:     18
          init_type_reply:    0
          active_width:       2X (16)
          active_speed:       50.0 Gbps (64)
          phys_state:     LINK_UP (5)
          GID[  0]:       fe80:0000:0000:0000:b859:9f03:004e:bd14