openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.07k stars 409 forks source link

UCX does not include InfiniBand when building with NVHPC compilers #8397

Open mcuma opened 1 year ago

mcuma commented 1 year ago

Describe the bug

We are building OpenMPI with UCX as fabric using Spack package manager on our clusters that have various generations of Mellanox InfiniBand. We specify the verbs and mlx5 providers. When building with GNU or Intel compilers, ucx_info correctly shows these, but with the NVHPC it does not. I have verified in the Spack build logs that the libibverbs and libmlx5 are being found by configure and linked in both the GNU/Intel and NVHPC builds, which makes me perplexed as to why the ucx_info is not reporting these as active in the NVHPC build.

OpenMPI which is built atop of this UCX, in the NVHPC build case, runs TCP over IB resulting in ~20 us latencies, while the gcc/intel build runs over IB with ~1.7 us latencies.

Any thoughts on this would be appreciated. I have also did the same build on our older CentOS 7 system and the issue is present there too. So, I don't think it's OS/driver stack related.

I'll be happy to provide more info but first I'd be curious to hear if you had similar reports in the past, or, if someone has successfully built UCX with NVHPC for IB.

Steps to Reproduce

Spack command: spack install openmpi@4.1.3%nvhpc@21.5~pmi target=nehalem fabrics=ucx +internal-hwloc+thread_multiple schedulers=slurm +legacylaunchers ^ucx +mlx5-dv+verbs+cm+ud+dc+rc+cma ^diffutils@3.7 ^perl@5.30.0

For the GNU (on Rocky Linux 8), replace nvhpc@21.5 with gcc@8.5.0)

This results in the following configure arguments: --with-verbs=/usr --disable-mt --enable-cma --disable-params-check --without-avx --enable-optimizations --disable-assertions --disable-logging --with-pic --with-rc --with-ud --with-dc --with-mlx5-dv --without-ib-hw-tm --without-dm --with-cm --without-rocm --without-java --without-cuda --without-gdrcopy --without-knem --without-xpmem

NVHPC build: $ ucx_info -v

UCT version=1.11.2 revision ef2bbcf

configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/nvhpc-21.5/ucx-1.11.2-asrhvd26hyucdhokcp6l5ufukmgxync7 --with-verbs=/usr --enable-mt --enable-cma --disable-params-check --without-avx --enable-optimizations --disable-assertions --disable-logging --with-pic --with-rc --with-ud --with-dc --with-mlx5-dv --without-ib-hw-tm --without-dm --with-cm --without-rocm --without-java --without-cuda --without-gdrcopy --without-knem --without-xpmem

GCC build: $ucx_info -v

UCT version=1.11.2 revision ef2bbcf

configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/gcc-8.5.0/ucx-1.11.2-etdhrh4gzoj2nroomy7bipr2p2e3ly4l --with-verbs=/usr --enable-mt --enable-cma --disable-params-check --without-avx --enable-optimizations --disable-assertions --disable-logging --with-pic --with-rc --with-ud --with-dc --with-mlx5-dv --without-ib-hw-tm --without-dm --with-cm --without-rocm --without-java --without-cuda --without-gdrcopy --without-knem --without-xpmem

$ /uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/nvhpc-21.7/ucx-1.11.2-jvwucdhwoqpn2xsttr55wgb5kzzbo32v/bin/ucx_info -d | grep verbs (nothing) $ /uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/gcc-8.5.0/ucx-1.11.2-ujc57b4cyrldztpxujdb7v3kaaww54tt/bin/ucx_info -d | grep verbs

Transport: rc_verbs

Transport: ud_verbs

$ /uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/nvhpc-21.7/ucx-1.11.2-jvwucdhwoqpn2xsttr55wgb5kzzbo32v/bin/ucx_info -d | grep mlx

< failed to open memory domain mlx4_0 >

$ /uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/gcc-8.5.0/ucx-1.11.2-ujc57b4cyrldztpxujdb7v3kaaww54tt/bin/ucx_info -d | grep mlx

Memory domain: mlx4_0

Device: mlx4_0:1

Device: mlx4_0:1

Setup and versions

$ cat /etc/os-release NAME="Rocky Linux" VERSION="8.5 (Green Obsidian)" ID="rocky" ID_LIKE="rhel centos fedora" VERSION_ID="8.5" PLATFORM_ID="platform:el8" PRETTY_NAME="Rocky Linux 8.5 (Green Obsidian)" ANSI_COLOR="0;32" CPE_NAME="cpe:/o:rocky:rocky:8:GA" HOME_URL="https://rockylinux.org/" BUG_REPORT_URL="https://bugs.rockylinux.org/" ROCKY_SUPPORT_PRODUCT="Rocky Linux" ROCKY_SUPPORT_PRODUCT_VERSION="8"

$ uname -a Linux notchpeak2 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Thu Mar 10 20:59:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

$ rpm -q rdma-core rdma-core-35.0-1.el8.x86_64

$ rpm -q libibverbs libibverbs-35.0-1.el8.x86_64

$ ibv_devinfo -vv hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.42.5000 node_guid: 0002:c903:00a4:faa0 sys_image_guid: 0002:c903:00a4:faa3 vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x1 board_id: MT_1090120019 phys_port_cnt: 2 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffe00 max_qp: 131000 max_qp_wr: 16351 device_cap_flags: 0x057e9c76 BAD_PKEY_CNTR BAD_QKEY_CNTR AUTO_PATH_MIG CHANGE_PHY_PORT UD_AV_PORT_ENFORCE PORT_ACTIVE_EVENT SYS_IMAGE_GUID RC_RNR_NAK_GEN MEM_WINDOW UD_IP_CSUM XRC MEM_MGT_EXTENSIONS MEM_WINDOW_TYPE_2B RAW_IP_CSUM Unknown flags: 0x488000 max_sge: 32 max_sge_rd: 30 max_cq: 65408 max_cqe: 4194303 max_mr: 524032 max_pd: 32764 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 2096000 max_qp_init_rd_atom: 128 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 0 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 8192 max_mcast_qp_attach: 248 max_total_mcast_qp_attach: 2031616 max_ah: 2147483647 max_fmr: 0 max_srq: 65472 max_srq_wr: 16383 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 15 general_odp_caps: rc_odp_caps: NO SUPPORT uc_odp_caps: NO SUPPORT ud_odp_caps: NO SUPPORT xrc_odp_caps: NO SUPPORT completion timestamp_mask: 0x0000ffffffffffff hca_core_clock: 427000kHZ device_cap_flags_ex: 0x57E9C76 tso_caps: max_tso: 0 rss_caps: max_rwq_indirection_tables: 0 max_rwq_indirection_table_size: 0 rx_hash_function: 0x0 rx_hash_fields_mask: 0x0 max_wq_type_rq: 0 packet_pacing_caps: qp_rate_limit_min: 0kbps qp_rate_limit_max: 0kbps tag matching not supported

cq moderation caps:
    max_cq_count:   65535
    max_cq_period:  65535 us

num_comp_vectors:       64
    port:   1
        state:          PORT_ACTIVE (4)
        max_mtu:        4096 (5)
        active_mtu:     4096 (5)
        sm_lid:         1
        port_lid:       513
        port_lmc:       0x00
        link_layer:     InfiniBand
        max_msg_sz:     0x40000000
        port_cap_flags:     0x02594868
        port_cap_flags2:    0x0000
        max_vl_num:     8 (4)
        bad_pkey_cntr:      0x0
        qkey_viol_cntr:     0x0
        sm_sl:          0
        pkey_tbl_len:       128
        gid_tbl_len:        128
        subnet_timeout:     18
        init_type_reply:    0
        active_width:       4X (2)
        active_speed:       14.0 Gbps (16)
        phys_state:     LINK_UP (5)
        GID[  0]:       fe80:0000:0000:0000:0002:c903:00a4:faa1

    port:   2
        state:          PORT_DOWN (1)
        max_mtu:        4096 (5)
        active_mtu:     4096 (5)
        sm_lid:         0
        port_lid:       0
        port_lmc:       0x00
        link_layer:     InfiniBand
        max_msg_sz:     0x40000000
        port_cap_flags:     0x02594868
        port_cap_flags2:    0x0000
        max_vl_num:     8 (4)
        bad_pkey_cntr:      0x0
        qkey_viol_cntr:     0x0
        sm_sl:          0
        pkey_tbl_len:       128
        gid_tbl_len:        128
        subnet_timeout:     0
        init_type_reply:    0
        active_width:       4X (2)
        active_speed:       2.5 Gbps (1)
        phys_state:     POLLING (2)
        GID[  0]:       fe80:0000:0000:0000:0002:c903:00a4:faa2

Additional information (depending on the issue)

$ /uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/nvhpc-21.7/ucx-1.11.2-jvwucdhwoqpn2xsttr55wgb5kzzbo32v/bin/ucx_info -d #

Memory domain: posix

Component: posix

allocate: unlimited

remote key: 24 bytes

rkey_ptr is supported

#

Transport: posix

Device: memory

System device:

#

capabilities:

bandwidth: 0.00/ppn + 12179.00 MB/sec

latency: 80 nsec

overhead: 10 nsec

put_short: <= 4294967295

put_bcopy: unlimited

get_bcopy: unlimited

am_short: <= 100

am_bcopy: <= 8256

domain: cpu

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 8 bytes

iface address: 8 bytes

error handling: ep_check

# #

Memory domain: sysv

Component: sysv

allocate: unlimited

remote key: 12 bytes

rkey_ptr is supported

#

Transport: sysv

Device: memory

System device:

#

capabilities:

bandwidth: 0.00/ppn + 12179.00 MB/sec

latency: 80 nsec

overhead: 10 nsec

put_short: <= 4294967295

put_bcopy: unlimited

get_bcopy: unlimited

am_short: <= 100

am_bcopy: <= 8256

domain: cpu

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 8 bytes

iface address: 8 bytes

error handling: ep_check

# #

Memory domain: self

Component: self

register: unlimited, cost: 0 nsec

remote key: 0 bytes

#

Transport: self

Device: memory0

System device:

#

capabilities:

bandwidth: 0.00/ppn + 6911.00 MB/sec

latency: 0 nsec

overhead: 10 nsec

put_short: <= 4294967295

put_bcopy: unlimited

get_bcopy: unlimited

am_short: <= 8K

am_bcopy: <= 8K

domain: cpu

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 0 bytes

iface address: 8 bytes

error handling: ep_check

# #

Memory domain: tcp

Component: tcp

register: unlimited, cost: 0 nsec

remote key: 0 bytes

#

Transport: tcp

Device: eth0

System device:

#

capabilities:

bandwidth: 113.16/ppn + 0.00 MB/sec

latency: 5776 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 1

device num paths: 1

max eps: 256

device address: 6 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

#

Transport: tcp

Device: lo

System device:

#

capabilities:

bandwidth: 11.91/ppn + 0.00 MB/sec

latency: 10960 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 1

device num paths: 1

max eps: 256

device address: 18 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

#

Transport: tcp

Device: eth0.26

System device:

#

capabilities:

bandwidth: 113.16/ppn + 0.00 MB/sec

latency: 5776 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 0

device num paths: 1

max eps: 256

device address: 6 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

#

Transport: tcp

Device: ib0

System device:

#

capabilities:

bandwidth: 6239.81/ppn + 0.00 MB/sec

latency: 5210 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 1

device num paths: 1

max eps: 256

device address: 6 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

# #

Connection manager: tcp

max_conn_priv: 2064 bytes

< failed to open memory domain mlx4_0 >

#

Connection manager: rdmacm

max_conn_priv: 54 bytes

#

Memory domain: cma

Component: cma

register: unlimited, cost: 9 nsec

#

Transport: cma

Device: memory

System device:

#

capabilities:

bandwidth: 0.00/ppn + 11145.00 MB/sec

latency: 80 nsec

overhead: 400 nsec

put_zcopy: unlimited, up to 16 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 1

get_zcopy: unlimited, up to 16 iov

get_opt_zcopy_align: <= 1

get_align_mtu: <= 1

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 8 bytes

iface address: 4 bytes

error handling: peer failure, ep_check

# $ /uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/gcc-8.5.0/ucx-1.11.2-ujc57b4cyrldztpxujdb7v3kaaww54tt/bin/ucx_info -d #

Memory domain: posix

Component: posix

allocate: unlimited

remote key: 24 bytes

rkey_ptr is supported

#

Transport: posix

Device: memory

System device:

#

capabilities:

bandwidth: 0.00/ppn + 12179.00 MB/sec

latency: 80 nsec

overhead: 10 nsec

put_short: <= 4294967295

put_bcopy: unlimited

get_bcopy: unlimited

am_short: <= 100

am_bcopy: <= 8256

domain: cpu

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 8 bytes

iface address: 8 bytes

error handling: ep_check

# #

Memory domain: sysv

Component: sysv

allocate: unlimited

remote key: 12 bytes

rkey_ptr is supported

#

Transport: sysv

Device: memory

System device:

#

capabilities:

bandwidth: 0.00/ppn + 12179.00 MB/sec

latency: 80 nsec

overhead: 10 nsec

put_short: <= 4294967295

put_bcopy: unlimited

get_bcopy: unlimited

am_short: <= 100

am_bcopy: <= 8256

domain: cpu

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 8 bytes

iface address: 8 bytes

error handling: ep_check

# #

Memory domain: self

Component: self

register: unlimited, cost: 0 nsec

remote key: 0 bytes

#

Transport: self

Device: memory0

System device:

#

capabilities:

bandwidth: 0.00/ppn + 6911.00 MB/sec

latency: 0 nsec

overhead: 10 nsec

put_short: <= 4294967295

put_bcopy: unlimited

get_bcopy: unlimited

am_short: <= 8K

am_bcopy: <= 8K

domain: cpu

atomic_add: 32, 64 bit

atomic_and: 32, 64 bit

atomic_or: 32, 64 bit

atomic_xor: 32, 64 bit

atomic_fadd: 32, 64 bit

atomic_fand: 32, 64 bit

atomic_for: 32, 64 bit

atomic_fxor: 32, 64 bit

atomic_swap: 32, 64 bit

atomic_cswap: 32, 64 bit

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 0 bytes

iface address: 8 bytes

error handling: ep_check

# #

Memory domain: tcp

Component: tcp

register: unlimited, cost: 0 nsec

remote key: 0 bytes

#

Transport: tcp

Device: eth0

System device:

#

capabilities:

bandwidth: 113.16/ppn + 0.00 MB/sec

latency: 5776 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 1

device num paths: 1

max eps: 256

device address: 6 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

#

Transport: tcp

Device: lo

System device:

#

capabilities:

bandwidth: 11.91/ppn + 0.00 MB/sec

latency: 10960 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 1

device num paths: 1

max eps: 256

device address: 18 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

#

Transport: tcp

Device: eth0.26

System device:

#

capabilities:

bandwidth: 113.16/ppn + 0.00 MB/sec

latency: 5776 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 0

device num paths: 1

max eps: 256

device address: 6 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

#

Transport: tcp

Device: ib0

System device:

#

capabilities:

bandwidth: 6239.81/ppn + 0.00 MB/sec

latency: 5210 nsec

overhead: 50000 nsec

put_zcopy: <= 18446744073709551590, up to 6 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 0

am_short: <= 8K

am_bcopy: <= 8K

am_zcopy: <= 64K, up to 6 iov

am_opt_zcopy_align: <= 1

am_align_mtu: <= 0

am header: <= 8037

connection: to ep, to iface

device priority: 1

device num paths: 1

max eps: 256

device address: 6 bytes

iface address: 2 bytes

ep address: 10 bytes

error handling: peer failure, ep_check, keepalive

# #

Connection manager: tcp

max_conn_priv: 2064 bytes

#

Memory domain: mlx4_0

Component: ib

register: unlimited, cost: 180 nsec

remote key: 8 bytes

local memory handle is required for zcopy

#

Transport: rc_verbs

Device: mlx4_0:1

System device: 0000:42:00.0 (0)

#

capabilities:

bandwidth: 6433.22/ppn + 0.00 MB/sec

latency: 700 + 1.000 * N nsec

overhead: 75 nsec

put_short: <= 88

put_bcopy: <= 8256

put_zcopy: <= 1G, up to 6 iov

put_opt_zcopy_align: <= 512

put_align_mtu: <= 2K

get_bcopy: <= 8256

get_zcopy: 65..1G, up to 6 iov

get_opt_zcopy_align: <= 512

get_align_mtu: <= 2K

am_short: <= 87

am_bcopy: <= 8255

am_zcopy: <= 8255, up to 5 iov

am_opt_zcopy_align: <= 512

am_align_mtu: <= 2K

am header: <= 127

domain: device

atomic_add: 64 bit

atomic_fadd: 64 bit

atomic_cswap: 64 bit

connection: to ep

device priority: 10

device num paths: 1

max eps: 256

device address: 4 bytes

ep address: 4 bytes

error handling: peer failure, ep_check

# #

Transport: ud_verbs

Device: mlx4_0:1

System device: 0000:42:00.0 (0)

#

capabilities:

bandwidth: 6433.22/ppn + 0.00 MB/sec

latency: 730 nsec

overhead: 105 nsec

am_short: <= 172

am_bcopy: <= 4088

am_zcopy: <= 4088, up to 8 iov

am_opt_zcopy_align: <= 512

am_align_mtu: <= 4K

am header: <= 3952

connection: to ep, to iface

device priority: 10

device num paths: 1

max eps: inf

device address: 4 bytes

iface address: 3 bytes

ep address: 6 bytes

error handling: peer failure, ep_check

# #

Connection manager: rdmacm

max_conn_priv: 54 bytes

#

Memory domain: cma

Component: cma

register: unlimited, cost: 9 nsec

#

Transport: cma

Device: memory

System device:

#

capabilities:

bandwidth: 0.00/ppn + 11145.00 MB/sec

latency: 80 nsec

overhead: 400 nsec

put_zcopy: unlimited, up to 16 iov

put_opt_zcopy_align: <= 1

put_align_mtu: <= 1

get_zcopy: unlimited, up to 16 iov

get_opt_zcopy_align: <= 1

get_align_mtu: <= 1

connection: to iface

device priority: 0

device num paths: 1

max eps: inf

device address: 8 bytes

iface address: 4 bytes

error handling: peer failure, ep_check

#

Configure outputs with excerpts for mlx and verbs: NVHPC: $ grep mlx spack-build-02-configure-out.txt checking infiniband/mlx5_hw.h usability... no checking infiniband/mlx5_hw.h presence... no checking for infiniband/mlx5_hw.h... no checking for mlx5dv_query_device in -lmlx5-rdmav2... no checking for mlx5dv_query_device in -lmlx5... yes checking for infiniband/mlx5dv.h... yes checking whether mlx5dv_init_obj is declared... yes checking whether mlx5dv_create_qp is declared... yes checking whether mlx5dv_is_supported is declared... yes checking whether mlx5dv_devx_subscribe_devx_event is declared... yes checking for struct mlx5dv_cq.cq_uar... yes configure: Compiling with mlx5 bare-metal support checking for struct mlx5_wqe_av.base... no checking for struct mlx5_grh_av.rmac... no checking for struct mlx5_cqe64.ib_stride_index... no $ grep verbs spack-build-02-configure-out.txt configure: Compiling with verbs support from /usr checking infiniband/verbs.h usability... yes checking infiniband/verbs.h presence... yes checking for infiniband/verbs.h... yes checking for ibv_get_device_list in -libverbs... yes checking infiniband/verbs_exp.h usability... no checking infiniband/verbs_exp.h presence... no checking for infiniband/verbs_exp.h... no

GCC: $ grep mlx spack-build-02-configure-out.txt checking infiniband/mlx5_hw.h usability... no checking infiniband/mlx5_hw.h presence... no checking for infiniband/mlx5_hw.h... no checking for mlx5dv_query_device in -lmlx5-rdmav2... no checking for mlx5dv_query_device in -lmlx5... yes checking for infiniband/mlx5dv.h... yes checking whether mlx5dv_init_obj is declared... yes checking whether mlx5dv_create_qp is declared... yes checking whether mlx5dv_is_supported is declared... yes checking whether mlx5dv_devx_subscribe_devx_event is declared... yes checking for struct mlx5dv_cq.cq_uar... yes configure: Compiling with mlx5 bare-metal support checking for struct mlx5_wqe_av.base... no checking for struct mlx5_grh_av.rmac... no checking for struct mlx5_cqe64.ib_stride_index... no $ grep verbs spack-build-02-configure-out.txt configure: Compiling with verbs support from /usr checking infiniband/verbs.h usability... yes checking infiniband/verbs.h presence... yes checking for infiniband/verbs.h... yes checking for ibv_get_device_list in -libverbs... yes checking infiniband/verbs_exp.h usability... no checking infiniband/verbs_exp.h presence... no checking for infiniband/verbs_exp.h... no

hoopoepg commented 1 year ago

hi @mcuma thank you for bug report

could you provide complete output from configure script, config.h and config.log files from UCX built over NVHPC environment?

is it possible to build debug version of UCX (with enabled logging) over NVHPC environment and provide output from command UCX_LOG_LEVEL=debug ucx_info -d

thank you

tonycurtis commented 1 year ago

FWIW I've seen the nvidia/pgi compilers "intrude" on the environment. When the module was loaded, even ibv_devinfo was failing. It could be LD_LIBRARY_PATH or something similar that is causing the problem. Maybe ldd/lddtree the executable to see if unexpected libraries are being picked up? There are also various versions of the nvidia compilers that include/exclude an in-built (Open-)MPI.

mcuma commented 1 year ago

Hi @hoopoepg ,

thanks for your reply and the debug suggestion. I am not sure how to well attach files to this ticket so let me put them to a public link, https://home.chpc.utah.edu/~mcuma/debug/ucx/

In particular, the spack_build directory contains the requested files from the Spack build, debug_build contains the configure and ucx_info output from a build with --enable-debug option, and nodebug_build the same output with the build w/o debug.

Interestingly, the build with --enable-debug includes the verbs provider, while the build without debug does not. The difference seems to be the optimization flag. The --enable-debug forces -O0, while the non-debug uses -O3.

If one adds --enable-compiler-opt=1 or 1 configure option to force the -O1 or -O0, it also results in a correct IB build.

Would you like me to open a ticket with the NVHPC group to address this?

Though, even with the --enable-compiler-opt=1 , the OpenMPI built with UCX is complaining when running: $ mpirun -x UCX_NET_DEVICES=mlx5_0:1 -np 2 ./a.out Process 0 on notch308 STARTING LATENCY AND BANDWIDTH BENCHMARK Process 1 on notch309 [notch308:2480858] pml_ucx.c:908 Error: mca_pml_ucx_send_nbr failed: -2, No resources are available to initiate the operation [notch308:2480858] An error occurred in MPI_Barrier [notch308:2480858] reported by process [1877147649,0] [notch308:2480858] on communicator MPI_COMM_WORLD [notch308:2480858] MPI_ERR_OTHER: known error not in list [notch308:2480858] MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [notch308:2480858] and potentially your MPI job) [notch309:2758566] pml_ucx.c:908 Error: mca_pml_ucx_send_nbr failed: -2, No resources are available to initiate the operation [notch308:2480848] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal [notch308:2480848] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

It seems like that device is still not set up correctly. Any thoughts on this are appreciated.

Thanks.

hoopoepg commented 1 year ago

as I can see from logs all builds successfully detected verbs library and enabled basic IB support.

could you run command ldd -r <PATH-WHERE-UCX-INSTALLED>/lib/ucx/libuct_ib.so on UCX build which can't detect IB devices and check if all dependencies are available?

It seems like that device is still not set up correctly. Any thoughts on this are appreciated.

yes, it could be reason, but as I can see from your bug report (ibv_devinfo) - devices are configured properly. what is output for command ulimit -a ?

mcuma commented 1 year ago

Hi Sergey,

the ldd seems to get the IB libraries correctly: $ ldd -r /uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/nvhpc-21.5/ucx-1.11.2-asrhvd26hyucdhokcp6l5ufukmgxync7/lib/ucx/libuct_ib.so linux-vdso.so.1 (0x00007ffd3045e000) libibverbs.so.1 => /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libibverbs.so.1 (0x00007f9bef67f000) libmlx5.so.1 => /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libmlx5.so.1 (0x00007f9bef42c000) libuct.so.0 => /uufs/chpc.utah.edu/sys/spack/linux-rocky8-nehalem/nvhpc-21.5/ucx-1.11.2-asrhvd26hyucdhokcp6l5ufukmgxync7/lib/libuct.so.0 (0x00007f9bef1ef000) ....

The IB drivers should be OK, we have no issue with UCX built with GNU and Intel compilers, and IB works fine with MVAPICH2 and Intel MPI as well.

Here's the ulimit -a output, anything suspicious there? We did raise a few limits in the past to accommodate MPI buffers, etc. $ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 384899 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 16384 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 65535 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

I am wondering if you had reports like this in the past or if you have a platform where you could try to reproduce what we're doing? It seems not too many sites are using NVHPC with MPI/IB so I don' t have any contacts to cross-check this with.

Also, the OpenMPI supplied with the NVHPC suite does not have the IB support built in, and neither does UCX that's shipped with the Rocky Linux 8 that we run - at least the "ucx_info -d" does not show it.

Thanks.

mcuma commented 1 year ago

One more tidbit that I found. On our older cluster that still has CentOS 7, UCX builds correctly with NVHPC (= includes the verbs providers), but, OpenMPI still crashes with the "No resources available" error.

I ended up with a workaround to build UCX with the stock OS gcc and then building OpenMPI with this UCX which results in correct OpenMPI behavior.

So, I am good for now but still wondering if the issue we have is local due to our OS setup, or due to lack of other sites building UCX with NVHPC and verbs.

hoopoepg commented 1 year ago

hi glad to see you have workaround for the issue. we will try to reproduce it locally (unfortunately we can't reproduce it for now) and try to identify root cause

thank you for your help