Closed tonycurtis closed 2 years ago
Running v5.8 of the OSU benchmarks: here's the MPI pt2pt bidirectional b/w (2 nodes, 1 rank per node)
With Open-MPI 4.1.2 + UCX 1.11.2
# OSU MPI Bi-Directional Bandwidth Test v5.8 # Size Bandwidth (MB/s) 1 1.35 2 2.63 4 5.37 8 10.52 16 20.86 32 40.74 64 65.34 128 117.28 256 176.25 512 355.75 1024 595.62 2048 1051.08 4096 1894.89 8192 3552.10 16384 4096.02 32768 9514.92 65536 13732.91 131072 16355.22 262144 17819.39 524288 18648.07 1048576 19095.27 2097152 19146.01 4194304 18865.67
With Open-MPI 4.1.2 + UCX 1.12 (.0 and .1-rc2)
# OSU MPI Bi-Directional Bandwidth Test v5.8 # Size Bandwidth (MB/s) 1 1.49 2 2.87 4 5.65 8 11.63 16 23.09 32 44.72 64 70.89 128 127.54 256 176.00 512 380.11 1024 653.22 2048 1242.58 4096 2153.03 8192 3659.16 16384 3822.86 32768 3990.97 65536 4104.85 131072 4126.15 262144 4223.85 524288 4175.43 1048576 4224.67 2097152 3748.04 4194304 18545.39
UCX_LOG_LEVEL=info shows rc_mlx5 is being used inter-node
cat /etc/issue
cat /etc/redhat-release
uname -a
CentOS Linux release 8.1.1911 (Core) Linux login1 4.18.0-147.el8.aarch64 #1 SMP Wed Dec 4 21:57:21 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux
(aarch64 == a64fx)
$ ucx_info -d # # Memory domain: self # Component: self # register: unlimited, cost: 0 nsec # remote key: 0 bytes # # Transport: self # Device: memory0 # Type: loopback # System device: <unknown> # # capabilities: # bandwidth: 0.00/ppn + 6911.00 MB/sec # latency: 0 nsec # overhead: 10 nsec # put_short: <= 4294967295 # put_bcopy: unlimited # get_bcopy: unlimited # am_short: <= 8K # am_bcopy: <= 8K # domain: cpu # atomic_add: 32, 64 bit # atomic_and: 32, 64 bit # atomic_or: 32, 64 bit # atomic_xor: 32, 64 bit # atomic_fadd: 32, 64 bit # atomic_fand: 32, 64 bit # atomic_for: 32, 64 bit # atomic_fxor: 32, 64 bit # atomic_swap: 32, 64 bit # atomic_cswap: 32, 64 bit # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 0 bytes # iface address: 8 bytes # error handling: ep_check # # # Memory domain: tcp # Component: tcp # register: unlimited, cost: 0 nsec # remote key: 0 bytes # # Transport: tcp # Device: enp11s0f1 # Type: network # System device: <unknown> # # capabilities: # bandwidth: 1131.64/ppn + 0.00 MB/sec # latency: 5258 nsec # overhead: 50000 nsec # put_zcopy: <= 18446744073709551590, up to 6 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 0 # am_short: <= 8K # am_bcopy: <= 8K # am_zcopy: <= 64K, up to 6 iov # am_opt_zcopy_align: <= 1 # am_align_mtu: <= 0 # am header: <= 8037 # connection: to ep, to iface # device priority: 1 # device num paths: 1 # max eps: 256 # device address: 6 bytes # iface address: 2 bytes # ep address: 10 bytes # error handling: peer failure, ep_check, keepalive # # Transport: tcp # Device: ib0 # Type: network # System device: <unknown> # # capabilities: # bandwidth: 4351.14/ppn + 0.00 MB/sec # latency: 5214 nsec # overhead: 50000 nsec # put_zcopy: <= 18446744073709551590, up to 6 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 0 # am_short: <= 8K # am_bcopy: <= 8K # am_zcopy: <= 64K, up to 6 iov # am_opt_zcopy_align: <= 1 # am_align_mtu: <= 0 # am header: <= 8037 # connection: to ep, to iface # device priority: 1 # device num paths: 1 # max eps: 256 # device address: 6 bytes # iface address: 2 bytes # ep address: 10 bytes # error handling: peer failure, ep_check, keepalive # # Transport: tcp # Device: lo # Type: network # System device: <unknown> # # capabilities: # bandwidth: 11.91/ppn + 0.00 MB/sec # latency: 10960 nsec # overhead: 50000 nsec # put_zcopy: <= 18446744073709551590, up to 6 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 0 # am_short: <= 8K # am_bcopy: <= 8K # am_zcopy: <= 64K, up to 6 iov # am_opt_zcopy_align: <= 1 # am_align_mtu: <= 0 # am header: <= 8037 # connection: to ep, to iface # device priority: 1 # device num paths: 1 # max eps: 256 # device address: 18 bytes # iface address: 2 bytes # ep address: 10 bytes # error handling: peer failure, ep_check, keepalive # # Transport: tcp # Device: enp11s0f0 # Type: network # System device: <unknown> # # capabilities: # bandwidth: 113.16/ppn + 0.00 MB/sec # latency: 5776 nsec # overhead: 50000 nsec # put_zcopy: <= 18446744073709551590, up to 6 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 0 # am_short: <= 8K # am_bcopy: <= 8K # am_zcopy: <= 64K, up to 6 iov # am_opt_zcopy_align: <= 1 # am_align_mtu: <= 0 # am header: <= 8037 # connection: to ep, to iface # device priority: 0 # device num paths: 1 # max eps: 256 # device address: 6 bytes # iface address: 2 bytes # ep address: 10 bytes # error handling: peer failure, ep_check, keepalive # # # Connection manager: tcp # max_conn_priv: 2064 bytes # # Memory domain: sysv # Component: sysv # allocate: unlimited # remote key: 12 bytes # rkey_ptr is supported # # Transport: sysv # Device: memory # Type: intra-node # System device: <unknown> # # capabilities: # bandwidth: 0.00/ppn + 12179.00 MB/sec # latency: 80 nsec # overhead: 10 nsec # put_short: <= 4294967295 # put_bcopy: unlimited # get_bcopy: unlimited # am_short: <= 100 # am_bcopy: <= 8256 # domain: cpu # atomic_add: 32, 64 bit # atomic_and: 32, 64 bit # atomic_or: 32, 64 bit # atomic_xor: 32, 64 bit # atomic_fadd: 32, 64 bit # atomic_fand: 32, 64 bit # atomic_for: 32, 64 bit # atomic_fxor: 32, 64 bit # atomic_swap: 32, 64 bit # atomic_cswap: 32, 64 bit # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 8 bytes # iface address: 8 bytes # error handling: ep_check # # # Memory domain: posix # Component: posix # allocate: <= 133949824K # remote key: 24 bytes # rkey_ptr is supported # # Transport: posix # Device: memory # Type: intra-node # System device: <unknown> # # capabilities: # bandwidth: 0.00/ppn + 12179.00 MB/sec # latency: 80 nsec # overhead: 10 nsec # put_short: <= 4294967295 # put_bcopy: unlimited # get_bcopy: unlimited # am_short: <= 100 # am_bcopy: <= 8256 # domain: cpu # atomic_add: 32, 64 bit # atomic_and: 32, 64 bit # atomic_or: 32, 64 bit # atomic_xor: 32, 64 bit # atomic_fadd: 32, 64 bit # atomic_fand: 32, 64 bit # atomic_for: 32, 64 bit # atomic_fxor: 32, 64 bit # atomic_swap: 32, 64 bit # atomic_cswap: 32, 64 bit # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 8 bytes # iface address: 8 bytes # error handling: ep_check # # # Memory domain: mlx5_0 # Component: ib # register: unlimited, cost: 180 nsec # remote key: 8 bytes # local memory handle is required for zcopy # memory invalidation is supported # # Transport: rc_verbs # Device: mlx5_0:1 # Type: network # System device: mlx5_0 (0) # # capabilities: # bandwidth: 3774.15/ppn + 0.00 MB/sec # latency: 1300 + 1.000 * N nsec # overhead: 75 nsec # put_short: <= 124 # put_bcopy: <= 8256 # put_zcopy: <= 1G, up to 5 iov # put_opt_zcopy_align: <= 512 # put_align_mtu: <= 4K # get_bcopy: <= 8256 # get_zcopy: 65..1G, up to 5 iov # get_opt_zcopy_align: <= 512 # get_align_mtu: <= 4K # am_short: <= 123 # am_bcopy: <= 8255 # am_zcopy: <= 8255, up to 4 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 4K # am header: <= 127 # domain: device # atomic_add: 64 bit # atomic_fadd: 64 bit # atomic_cswap: 64 bit # connection: to ep # device priority: 38 # device num paths: 1 # max eps: 256 # device address: 3 bytes # ep address: 5 bytes # error handling: peer failure, ep_check # # # Transport: rc_mlx5 # Device: mlx5_0:1 # Type: network # System device: mlx5_0 (0) # # capabilities: # bandwidth: 3774.15/ppn + 0.00 MB/sec # latency: 1300 + 1.000 * N nsec # overhead: 40 nsec # put_short: <= 2K # put_bcopy: <= 8256 # put_zcopy: <= 1G, up to 14 iov # put_opt_zcopy_align: <= 512 # put_align_mtu: <= 4K # get_bcopy: <= 8256 # get_zcopy: 65..1G, up to 14 iov # get_opt_zcopy_align: <= 512 # get_align_mtu: <= 4K # am_short: <= 2046 # am_bcopy: <= 8254 # am_zcopy: <= 8254, up to 3 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 4K # am header: <= 186 # domain: device # atomic_add: 32, 64 bit # atomic_and: 32, 64 bit # atomic_or: 32, 64 bit # atomic_xor: 32, 64 bit # atomic_fadd: 32, 64 bit # atomic_fand: 32, 64 bit # atomic_for: 32, 64 bit # atomic_fxor: 32, 64 bit # atomic_swap: 32, 64 bit # atomic_cswap: 32, 64 bit # connection: to ep # device priority: 38 # device num paths: 1 # max eps: 256 # device address: 3 bytes # ep address: 7 bytes # error handling: buffer (zcopy), remote access, peer failure, ep_check # # # Transport: dc_mlx5 # Device: mlx5_0:1 # Type: network # System device: mlx5_0 (0) # # capabilities: # bandwidth: 3774.15/ppn + 0.00 MB/sec # latency: 1360 nsec # overhead: 40 nsec # put_short: <= 2K # put_bcopy: <= 8256 # put_zcopy: <= 1G, up to 11 iov # put_opt_zcopy_align: <= 512 # put_align_mtu: <= 4K # get_bcopy: <= 8256 # get_zcopy: 65..1G, up to 11 iov # get_opt_zcopy_align: <= 512 # get_align_mtu: <= 4K # am_short: <= 2046 # am_bcopy: <= 8254 # am_zcopy: <= 8254, up to 3 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 4K # am header: <= 138 # domain: device # atomic_add: 32, 64 bit # atomic_and: 32, 64 bit # atomic_or: 32, 64 bit # atomic_xor: 32, 64 bit # atomic_fadd: 32, 64 bit # atomic_fand: 32, 64 bit # atomic_for: 32, 64 bit # atomic_fxor: 32, 64 bit # atomic_swap: 32, 64 bit # atomic_cswap: 32, 64 bit # connection: to iface # device priority: 38 # device num paths: 1 # max eps: inf # device address: 3 bytes # iface address: 5 bytes # error handling: buffer (zcopy), remote access, peer failure, ep_check # # # Transport: ud_verbs # Device: mlx5_0:1 # Type: network # System device: mlx5_0 (0) # # capabilities: # bandwidth: 3774.15/ppn + 0.00 MB/sec # latency: 1330 nsec # overhead: 105 nsec # am_short: <= 116 # am_bcopy: <= 4088 # am_zcopy: <= 4088, up to 5 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 4K # am header: <= 3952 # connection: to ep, to iface # device priority: 38 # device num paths: 1 # max eps: inf # device address: 3 bytes # iface address: 3 bytes # ep address: 6 bytes # error handling: peer failure, ep_check # # # Transport: ud_mlx5 # Device: mlx5_0:1 # Type: network # System device: mlx5_0 (0) # # capabilities: # bandwidth: 3774.15/ppn + 0.00 MB/sec # latency: 1330 nsec # overhead: 80 nsec # am_short: <= 180 # am_bcopy: <= 4088 # am_zcopy: <= 4088, up to 3 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 4K # am header: <= 132 # connection: to ep, to iface # device priority: 38 # device num paths: 1 # max eps: inf # device address: 3 bytes # iface address: 3 bytes # ep address: 6 bytes # error handling: peer failure, ep_check # # # Memory domain: mlx5_1 # Component: ib # register: unlimited, cost: 180 nsec # remote key: 8 bytes # local memory handle is required for zcopy # memory invalidation is supported # # Transport: rc_verbs # Device: mlx5_1:1 # Type: network # System device: mlx5_1 (1) # # capabilities: # bandwidth: 219.16/ppn + 0.00 MB/sec # latency: 5200 + 1.000 * N nsec # overhead: 75 nsec # put_short: <= 124 # put_bcopy: <= 8256 # put_zcopy: <= 1G, up to 5 iov # put_opt_zcopy_align: <= 512 # put_align_mtu: <= 1K # get_bcopy: <= 8256 # get_zcopy: 65..1G, up to 5 iov # get_opt_zcopy_align: <= 512 # get_align_mtu: <= 1K # am_short: <= 123 # am_bcopy: <= 8255 # am_zcopy: <= 8255, up to 4 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 1K # am header: <= 127 # domain: device # atomic_add: 64 bit # atomic_fadd: 64 bit # atomic_cswap: 64 bit # connection: to ep # device priority: 28 # device num paths: 1 # max eps: 256 # device address: 18 bytes # ep address: 4 bytes # error handling: peer failure, ep_check # # # Transport: rc_mlx5 # Device: mlx5_1:1 # Type: network # System device: mlx5_1 (1) # # capabilities: # bandwidth: 219.16/ppn + 0.00 MB/sec # latency: 5200 + 1.000 * N nsec # overhead: 40 nsec # put_short: <= 220 # put_bcopy: <= 8256 # put_zcopy: <= 1G, up to 14 iov # put_opt_zcopy_align: <= 512 # put_align_mtu: <= 1K # get_bcopy: <= 8256 # get_zcopy: 65..1G, up to 14 iov # get_opt_zcopy_align: <= 512 # get_align_mtu: <= 1K # am_short: <= 234 # am_bcopy: <= 8254 # am_zcopy: <= 8254, up to 3 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 1K # am header: <= 186 # domain: device # atomic_add: 64 bit # atomic_fadd: 64 bit # atomic_cswap: 64 bit # connection: to ep # device priority: 28 # device num paths: 1 # max eps: 256 # device address: 18 bytes # ep address: 7 bytes # error handling: buffer (zcopy), remote access, peer failure, ep_check # # # Transport: ud_verbs # Device: mlx5_1:1 # Type: network # System device: mlx5_1 (1) # # capabilities: # bandwidth: 219.16/ppn + 0.00 MB/sec # latency: 5230 nsec # overhead: 105 nsec # am_short: <= 116 # am_bcopy: <= 1016 # am_zcopy: <= 1016, up to 5 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 1K # am header: <= 880 # connection: to ep, to iface # device priority: 28 # device num paths: 1 # max eps: inf # device address: 18 bytes # iface address: 3 bytes # ep address: 6 bytes # error handling: peer failure, ep_check # # # Transport: ud_mlx5 # Device: mlx5_1:1 # Type: network # System device: mlx5_1 (1) # # capabilities: # bandwidth: 219.16/ppn + 0.00 MB/sec # latency: 5230 nsec # overhead: 80 nsec # am_short: <= 180 # am_bcopy: <= 1016 # am_zcopy: <= 1016, up to 3 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 1K # am header: <= 132 # connection: to ep, to iface # device priority: 28 # device num paths: 1 # max eps: inf # device address: 18 bytes # iface address: 3 bytes # ep address: 6 bytes # error handling: peer failure, ep_check # # # Memory domain: mlx5_2 # Component: ib # register: unlimited, cost: 180 nsec # remote key: 8 bytes # local memory handle is required for zcopy # memory invalidation is supported # # Transport: rc_verbs # Device: mlx5_2:1 # Type: network # System device: mlx5_2 (2) # # capabilities: # bandwidth: 1095.78/ppn + 0.00 MB/sec # latency: 1500 + 1.000 * N nsec # overhead: 75 nsec # put_short: <= 124 # put_bcopy: <= 8256 # put_zcopy: <= 1G, up to 5 iov # put_opt_zcopy_align: <= 512 # put_align_mtu: <= 1K # get_bcopy: <= 8256 # get_zcopy: 65..1G, up to 5 iov # get_opt_zcopy_align: <= 512 # get_align_mtu: <= 1K # am_short: <= 123 # am_bcopy: <= 8255 # am_zcopy: <= 8255, up to 4 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 1K # am header: <= 127 # domain: device # atomic_add: 64 bit # atomic_fadd: 64 bit # atomic_cswap: 64 bit # connection: to ep # device priority: 28 # device num paths: 1 # max eps: 256 # device address: 18 bytes # ep address: 4 bytes # error handling: peer failure, ep_check # # # Transport: rc_mlx5 # Device: mlx5_2:1 # Type: network # System device: mlx5_2 (2) # # capabilities: # bandwidth: 1095.78/ppn + 0.00 MB/sec # latency: 1500 + 1.000 * N nsec # overhead: 40 nsec # put_short: <= 220 # put_bcopy: <= 8256 # put_zcopy: <= 1G, up to 14 iov # put_opt_zcopy_align: <= 512 # put_align_mtu: <= 1K # get_bcopy: <= 8256 # get_zcopy: 65..1G, up to 14 iov # get_opt_zcopy_align: <= 512 # get_align_mtu: <= 1K # am_short: <= 234 # am_bcopy: <= 8254 # am_zcopy: <= 8254, up to 3 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 1K # am header: <= 186 # domain: device # atomic_add: 64 bit # atomic_fadd: 64 bit # atomic_cswap: 64 bit # connection: to ep # device priority: 28 # device num paths: 1 # max eps: 256 # device address: 18 bytes # ep address: 7 bytes # error handling: buffer (zcopy), remote access, peer failure, ep_check # # # Transport: ud_verbs # Device: mlx5_2:1 # Type: network # System device: mlx5_2 (2) # # capabilities: # bandwidth: 1095.78/ppn + 0.00 MB/sec # latency: 1530 nsec # overhead: 105 nsec # am_short: <= 116 # am_bcopy: <= 1016 # am_zcopy: <= 1016, up to 5 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 1K # am header: <= 880 # connection: to ep, to iface # device priority: 28 # device num paths: 1 # max eps: inf # device address: 18 bytes # iface address: 3 bytes # ep address: 6 bytes # error handling: peer failure, ep_check # # # Transport: ud_mlx5 # Device: mlx5_2:1 # Type: network # System device: mlx5_2 (2) # # capabilities: # bandwidth: 1095.78/ppn + 0.00 MB/sec # latency: 1530 nsec # overhead: 80 nsec # am_short: <= 180 # am_bcopy: <= 1016 # am_zcopy: <= 1016, up to 3 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 1K # am header: <= 132 # connection: to ep, to iface # device priority: 28 # device num paths: 1 # max eps: inf # device address: 18 bytes # iface address: 3 bytes # ep address: 6 bytes # error handling: peer failure, ep_check # # # Memory domain: mlx5_3 # Component: ib # register: unlimited, cost: 180 nsec # remote key: 8 bytes # local memory handle is required for zcopy # memory invalidation is supported # # Transport: rc_verbs # Device: mlx5_3:1 # Type: network # System device: mlx5_3 (3) # # capabilities: # bandwidth: 3774.15/ppn + 0.00 MB/sec # latency: 1300 + 1.000 * N nsec # overhead: 75 nsec # put_short: <= 124 # put_bcopy: <= 8256 # put_zcopy: <= 1G, up to 5 iov # put_opt_zcopy_align: <= 512 # put_align_mtu: <= 4K # get_bcopy: <= 8256 # get_zcopy: 65..1G, up to 5 iov # get_opt_zcopy_align: <= 512 # get_align_mtu: <= 4K # am_short: <= 123 # am_bcopy: <= 8255 # am_zcopy: <= 8255, up to 4 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 4K # am header: <= 127 # domain: device # atomic_add: 64 bit # atomic_fadd: 64 bit # atomic_cswap: 64 bit # connection: to ep # device priority: 38 # device num paths: 1 # max eps: 256 # device address: 3 bytes # ep address: 5 bytes # error handling: peer failure, ep_check # # # Transport: rc_mlx5 # Device: mlx5_3:1 # Type: network # System device: mlx5_3 (3) # # capabilities: # bandwidth: 3774.15/ppn + 0.00 MB/sec # latency: 1300 + 1.000 * N nsec # overhead: 40 nsec # put_short: <= 2K # put_bcopy: <= 8256 # put_zcopy: <= 1G, up to 14 iov # put_opt_zcopy_align: <= 512 # put_align_mtu: <= 4K # get_bcopy: <= 8256 # get_zcopy: 65..1G, up to 14 iov # get_opt_zcopy_align: <= 512 # get_align_mtu: <= 4K # am_short: <= 2046 # am_bcopy: <= 8254 # am_zcopy: <= 8254, up to 3 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 4K # am header: <= 186 # domain: device # atomic_add: 32, 64 bit # atomic_and: 32, 64 bit # atomic_or: 32, 64 bit # atomic_xor: 32, 64 bit # atomic_fadd: 32, 64 bit # atomic_fand: 32, 64 bit # atomic_for: 32, 64 bit # atomic_fxor: 32, 64 bit # atomic_swap: 32, 64 bit # atomic_cswap: 32, 64 bit # connection: to ep # device priority: 38 # device num paths: 1 # max eps: 256 # device address: 3 bytes # ep address: 7 bytes # error handling: buffer (zcopy), remote access, peer failure, ep_check # # # Transport: dc_mlx5 # Device: mlx5_3:1 # Type: network # System device: mlx5_3 (3) # # capabilities: # bandwidth: 3774.15/ppn + 0.00 MB/sec # latency: 1360 nsec # overhead: 40 nsec # put_short: <= 2K # put_bcopy: <= 8256 # put_zcopy: <= 1G, up to 11 iov # put_opt_zcopy_align: <= 512 # put_align_mtu: <= 4K # get_bcopy: <= 8256 # get_zcopy: 65..1G, up to 11 iov # get_opt_zcopy_align: <= 512 # get_align_mtu: <= 4K # am_short: <= 2046 # am_bcopy: <= 8254 # am_zcopy: <= 8254, up to 3 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 4K # am header: <= 138 # domain: device # atomic_add: 32, 64 bit # atomic_and: 32, 64 bit # atomic_or: 32, 64 bit # atomic_xor: 32, 64 bit # atomic_fadd: 32, 64 bit # atomic_fand: 32, 64 bit # atomic_for: 32, 64 bit # atomic_fxor: 32, 64 bit # atomic_swap: 32, 64 bit # atomic_cswap: 32, 64 bit # connection: to iface # device priority: 38 # device num paths: 1 # max eps: inf # device address: 3 bytes # iface address: 5 bytes # error handling: buffer (zcopy), remote access, peer failure, ep_check # # # Transport: ud_verbs # Device: mlx5_3:1 # Type: network # System device: mlx5_3 (3) # # capabilities: # bandwidth: 3774.15/ppn + 0.00 MB/sec # latency: 1330 nsec # overhead: 105 nsec # am_short: <= 116 # am_bcopy: <= 4088 # am_zcopy: <= 4088, up to 5 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 4K # am header: <= 3952 # connection: to ep, to iface # device priority: 38 # device num paths: 1 # max eps: inf # device address: 3 bytes # iface address: 3 bytes # ep address: 6 bytes # error handling: peer failure, ep_check # # # Transport: ud_mlx5 # Device: mlx5_3:1 # Type: network # System device: mlx5_3 (3) # # capabilities: # bandwidth: 3774.15/ppn + 0.00 MB/sec # latency: 1330 nsec # overhead: 80 nsec # am_short: <= 180 # am_bcopy: <= 4088 # am_zcopy: <= 4088, up to 3 iov # am_opt_zcopy_align: <= 512 # am_align_mtu: <= 4K # am header: <= 132 # connection: to ep, to iface # device priority: 38 # device num paths: 1 # max eps: inf # device address: 3 bytes # iface address: 3 bytes # ep address: 6 bytes # error handling: peer failure, ep_check # # # Connection manager: rdmacm # max_conn_priv: 54 bytes # # Memory domain: cma # Component: cma # register: unlimited, cost: 9 nsec # # Transport: cma # Device: memory # Type: intra-node # System device: <unknown> # # capabilities: # bandwidth: 0.00/ppn + 11145.00 MB/sec # latency: 80 nsec # overhead: 2000 nsec # put_zcopy: unlimited, up to 16 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 1 # get_zcopy: unlimited, up to 16 iov # get_opt_zcopy_align: <= 1 # get_align_mtu: <= 1 # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 8 bytes # iface address: 4 bytes # error handling: peer failure, ep_check # # # Memory domain: knem # Component: knem # register: unlimited, cost: 180 nsec # remote key: 16 bytes # # Transport: knem # Device: memory # Type: intra-node # System device: <unknown> # # capabilities: # bandwidth: 13862.00/ppn + 0.00 MB/sec # latency: 80 nsec # overhead: 2000 nsec # put_zcopy: unlimited, up to 16 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 1 # get_zcopy: unlimited, up to 16 iov # get_opt_zcopy_align: <= 1 # get_align_mtu: <= 1 # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 8 bytes # iface address: 0 bytes # error handling: none # # # Memory domain: xpmem # Component: xpmem # register: unlimited, cost: 60 nsec # remote key: 24 bytes # rkey_ptr is supported # # Transport: xpmem # Device: memory # Type: intra-node # System device: <unknown> # # capabilities: # bandwidth: 0.00/ppn + 12179.00 MB/sec # latency: 80 nsec # overhead: 10 nsec # put_short: <= 4294967295 # put_bcopy: unlimited # get_bcopy: unlimited # am_short: <= 100 # am_bcopy: <= 8256 # domain: cpu # atomic_add: 32, 64 bit # atomic_and: 32, 64 bit # atomic_or: 32, 64 bit # atomic_xor: 32, 64 bit # atomic_fadd: 32, 64 bit # atomic_fand: 32, 64 bit # atomic_for: 32, 64 bit # atomic_fxor: 32, 64 bit # atomic_swap: 32, 64 bit # atomic_cswap: 32, 64 bit # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 8 bytes # iface address: 16 bytes # error handling: none #
$ ompi_info Package: Open MPI arcurtis@login1 Distribution Open MPI: 4.1.2 Open MPI repo revision: v4.1.2 Open MPI release date: Nov 24, 2021 Open RTE: 4.1.2 Open RTE repo revision: v4.1.2 Open RTE release date: Nov 24, 2021 OPAL: 4.1.2 OPAL repo revision: v4.1.2 OPAL release date: Nov 24, 2021 MPI API: 3.1.0 Ident string: 4.1.2 Prefix: /lustre/home/arcurtis/opt/./openmpi/gcc8/4.1.2 Configured architecture: aarch64-unknown-linux-gnu Configure host: login1 Configured by: arcurtis Configured on: Tue Feb 15 16:15:28 UTC 2022 Configure host: login1 Configure command line: '--prefix=/lustre/home/arcurtis/opt/./openmpi/gcc8/4.1.2' '--with-knem=/opt/knem-1.1.3.90mlnx1' '--with-xpmem=/opt/xpmem' '--with-lustre' '--without-cuda' '--enable-mpi1-compatibility' '--disable-debug' '--with-libevent=internal' '--with-pmix=internal' '--with-hwloc=internal' '--with-hcoll=/opt/mellanox/hcoll' '--enable-mca-no-build=btl-uct' '--enable-mpi-fortran=yes' '--enable-oshmem-fortran=no' '--with-ucx=/lustre/software/ucx/1.11.2' '--enable-orterun-prefix-by-default' '--without-verbs' '--without-ofi' Built by: arcurtis Built on: Tue Feb 15 16:26:50 UTC 2022 Built host: login1 C bindings: yes C++ bindings: no Fort mpif.h: yes (all) Fort use mpi: yes (full: ignore TKR) Fort use mpi size: deprecated-ompi-info-value Fort use mpi_f08: yes Fort mpi_f08 compliance: The mpi_f08 module is available, but due to limitations in the gfortran compiler and/or Open MPI, does not support the following: array subsections, direct passthru (where possible) to underlying Open MPI's C functionality Fort mpi_f08 subarrays: no Java bindings: no Wrapper compiler rpath: runpath C compiler: gcc C compiler absolute: /usr/bin/gcc C compiler family name: GNU C compiler version: 8.4.1 C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fort compiler: gfortran Fort compiler abs: /usr/bin/gfortran Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::) Fort 08 assumed shape: yes Fort optional args: yes Fort INTERFACE: yes Fort ISO_FORTRAN_ENV: yes Fort STORAGE_SIZE: yes Fort BIND(C) (all): yes Fort ISO_C_BINDING: yes Fort SUBROUTINE BIND(C): yes Fort TYPE,BIND(C): yes Fort T,BIND(C,name="a"): yes Fort PRIVATE: yes Fort PROTECTED: yes Fort ABSTRACT: yes Fort ASYNCHRONOUS: yes Fort PROCEDURE: yes Fort USE...ONLY: yes Fort C_FUNLOC: yes Fort f08 using wrappers: yes Fort MPI_SIZEOF: yes C profiling: yes C++ profiling: no Fort mpif.h profiling: yes Fort use mpi profiling: yes Fort use mpi_f08 prof: yes C++ exceptions: no Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes) Sparse Groups: no Internal debug support: no MPI interface warnings: yes MPI parameter check: runtime Memory profiling support: no Memory debugging support: no dl support: yes Heterogeneous support: no mpirun default --prefix: yes MPI_WTIME support: native Symbol vis. support: yes Host topology support: yes IPv6 support: no MPI1 compatibility: yes MPI extensions: affinity, cuda, pcollreq FT Checkpoint support: no (checkpoint thread: no) C/R Enabled Debugging: no MPI_MAX_PROCESSOR_NAME: 256 MPI_MAX_ERROR_STRING: 256 MPI_MAX_OBJECT_NAME: 64 MPI_MAX_INFO_KEY: 36 MPI_MAX_INFO_VAL: 256 MPI_MAX_PORT_NAME: 1024 MPI_MAX_DATAREP_STRING: 128 MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.2) MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.2) MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.2) MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.1.2) MCA event: libevent2022 (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA hwloc: hwloc201 (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component v4.1.2) MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA pstat: test (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.1.2) MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA reachable: netlink (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.1.2) MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.1.2) MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.1.2) MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.1.2) MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.1.2) MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.1.2) MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.1.2) MCA schizo: jsm (MCA v2.1.0, API v1.0.0, Component v4.1.2) MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.1.2) MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.1.2) MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.1.2) MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.1.2) MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.1.2) MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.1.2) MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA coll: adapt (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA coll: han (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA coll: hcoll (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA fs: lustre (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.1.2) MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.1.2) MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component v4.1.2) MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.1.2) MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component v4.1.2)
No idea why this got posted twice
Describe the bug
Running v5.8 of the OSU benchmarks: here's the MPI pt2pt bidirectional b/w (2 nodes, 1 rank per node)
With Open-MPI 4.1.2 + UCX 1.11.2
With Open-MPI 4.1.2 + UCX 1.12 (.0 and .1-rc2)
UCX_LOG_LEVEL=info shows rc_mlx5 is being used inter-node
Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
(aarch64 == a64fx)