openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.12k stars 419 forks source link

RUN: performance hit from ucx 1.11 to 1.12 with OMPI & OSU benchmarks #7946

Closed tonycurtis closed 2 years ago

tonycurtis commented 2 years ago

Describe the bug

Running v5.8 of the OSU benchmarks: here's the MPI pt2pt bidirectional b/w (2 nodes, 1 rank per node)

With Open-MPI 4.1.2 + UCX 1.11.2

# OSU MPI Bi-Directional Bandwidth Test v5.8
# Size      Bandwidth (MB/s)
1                       1.35
2                       2.63
4                       5.37
8                      10.52
16                     20.86
32                     40.74
64                     65.34
128                   117.28
256                   176.25
512                   355.75
1024                  595.62
2048                 1051.08
4096                 1894.89
8192                 3552.10
16384                4096.02
32768                9514.92
65536               13732.91
131072              16355.22
262144              17819.39
524288              18648.07
1048576             19095.27
2097152             19146.01
4194304             18865.67

With Open-MPI 4.1.2 + UCX 1.12 (.0 and .1-rc2)

# OSU MPI Bi-Directional Bandwidth Test v5.8
# Size      Bandwidth (MB/s)
1                       1.49
2                       2.87
4                       5.65
8                      11.63
16                     23.09
32                     44.72
64                     70.89
128                   127.54
256                   176.00
512                   380.11
1024                  653.22
2048                 1242.58
4096                 2153.03
8192                 3659.16
16384                3822.86
32768                3990.97
65536                4104.85
131072               4126.15
262144               4223.85
524288               4175.43
1048576              4224.67
2097152              3748.04
4194304             18545.39

UCX_LOG_LEVEL=info shows rc_mlx5 is being used inter-node

Setup and versions

CentOS Linux release 8.1.1911 (Core)

Linux login1 4.18.0-147.el8.aarch64 #1 SMP Wed Dec 4 21:57:21 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux

(aarch64 == a64fx)

$ ucx_info -d
#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: self
#         Device: memory0
#           Type: loopback
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: tcp
#         Device: enp11s0f1
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 1131.64/ppn + 0.00 MB/sec
#              latency: 5258 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: ib0
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 4351.14/ppn + 0.00 MB/sec
#              latency: 5214 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: lo
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: enp11s0f0
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 113.16/ppn + 0.00 MB/sec
#              latency: 5776 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 0
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#
#      Transport: sysv
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: posix
#     Component: posix
#             allocate: <= 133949824K
#           remote key: 24 bytes
#           rkey_ptr is supported
#
#      Transport: posix
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: mlx5_0
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#           memory invalidation is supported
#
#      Transport: rc_verbs
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (0)
#
#      capabilities:
#            bandwidth: 3774.15/ppn + 0.00 MB/sec
#              latency: 1300 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 5 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 5 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 4 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 38
#     device num paths: 1
#              max eps: 256
#       device address: 3 bytes
#           ep address: 5 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: rc_mlx5
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (0)
#
#      capabilities:
#            bandwidth: 3774.15/ppn + 0.00 MB/sec
#              latency: 1300 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 186
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to ep
#      device priority: 38
#     device num paths: 1
#              max eps: 256
#       device address: 3 bytes
#           ep address: 7 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: dc_mlx5
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (0)
#
#      capabilities:
#            bandwidth: 3774.15/ppn + 0.00 MB/sec
#              latency: 1360 nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 11 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 11 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 138
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 38
#     device num paths: 1
#              max eps: inf
#       device address: 3 bytes
#        iface address: 5 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: ud_verbs
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (0)
#
#      capabilities:
#            bandwidth: 3774.15/ppn + 0.00 MB/sec
#              latency: 1330 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 3952
#           connection: to ep, to iface
#      device priority: 38
#     device num paths: 1
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: ud_mlx5
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (0)
#
#      capabilities:
#            bandwidth: 3774.15/ppn + 0.00 MB/sec
#              latency: 1330 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 38
#     device num paths: 1
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: mlx5_1
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#           memory invalidation is supported
#
#      Transport: rc_verbs
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (1)
#
#      capabilities:
#            bandwidth: 219.16/ppn + 0.00 MB/sec
#              latency: 5200 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 5 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 1K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 5 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 1K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 4 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 28
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#           ep address: 4 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: rc_mlx5
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (1)
#
#      capabilities:
#            bandwidth: 219.16/ppn + 0.00 MB/sec
#              latency: 5200 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 220
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 1K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 1K
#             am_short: <= 234
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 186
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 28
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#           ep address: 7 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: ud_verbs
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (1)
#
#      capabilities:
#            bandwidth: 219.16/ppn + 0.00 MB/sec
#              latency: 5230 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 1016
#             am_zcopy: <= 1016, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 880
#           connection: to ep, to iface
#      device priority: 28
#     device num paths: 1
#              max eps: inf
#       device address: 18 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: ud_mlx5
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (1)
#
#      capabilities:
#            bandwidth: 219.16/ppn + 0.00 MB/sec
#              latency: 5230 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 1016
#             am_zcopy: <= 1016, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 28
#     device num paths: 1
#              max eps: inf
#       device address: 18 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: mlx5_2
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#           memory invalidation is supported
#
#      Transport: rc_verbs
#         Device: mlx5_2:1
#           Type: network
#  System device: mlx5_2 (2)
#
#      capabilities:
#            bandwidth: 1095.78/ppn + 0.00 MB/sec
#              latency: 1500 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 5 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 1K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 5 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 1K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 4 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 28
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#           ep address: 4 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: rc_mlx5
#         Device: mlx5_2:1
#           Type: network
#  System device: mlx5_2 (2)
#
#      capabilities:
#            bandwidth: 1095.78/ppn + 0.00 MB/sec
#              latency: 1500 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 220
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 1K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 1K
#             am_short: <= 234
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 186
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 28
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#           ep address: 7 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: ud_verbs
#         Device: mlx5_2:1
#           Type: network
#  System device: mlx5_2 (2)
#
#      capabilities:
#            bandwidth: 1095.78/ppn + 0.00 MB/sec
#              latency: 1530 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 1016
#             am_zcopy: <= 1016, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 880
#           connection: to ep, to iface
#      device priority: 28
#     device num paths: 1
#              max eps: inf
#       device address: 18 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: ud_mlx5
#         Device: mlx5_2:1
#           Type: network
#  System device: mlx5_2 (2)
#
#      capabilities:
#            bandwidth: 1095.78/ppn + 0.00 MB/sec
#              latency: 1530 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 1016
#             am_zcopy: <= 1016, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 28
#     device num paths: 1
#              max eps: inf
#       device address: 18 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: mlx5_3
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#           memory invalidation is supported
#
#      Transport: rc_verbs
#         Device: mlx5_3:1
#           Type: network
#  System device: mlx5_3 (3)
#
#      capabilities:
#            bandwidth: 3774.15/ppn + 0.00 MB/sec
#              latency: 1300 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 5 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 5 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 4 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 38
#     device num paths: 1
#              max eps: 256
#       device address: 3 bytes
#           ep address: 5 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: rc_mlx5
#         Device: mlx5_3:1
#           Type: network
#  System device: mlx5_3 (3)
#
#      capabilities:
#            bandwidth: 3774.15/ppn + 0.00 MB/sec
#              latency: 1300 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 186
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to ep
#      device priority: 38
#     device num paths: 1
#              max eps: 256
#       device address: 3 bytes
#           ep address: 7 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: dc_mlx5
#         Device: mlx5_3:1
#           Type: network
#  System device: mlx5_3 (3)
#
#      capabilities:
#            bandwidth: 3774.15/ppn + 0.00 MB/sec
#              latency: 1360 nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 11 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 11 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 138
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 38
#     device num paths: 1
#              max eps: inf
#       device address: 3 bytes
#        iface address: 5 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: ud_verbs
#         Device: mlx5_3:1
#           Type: network
#  System device: mlx5_3 (3)
#
#      capabilities:
#            bandwidth: 3774.15/ppn + 0.00 MB/sec
#              latency: 1330 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 3952
#           connection: to ep, to iface
#      device priority: 38
#     device num paths: 1
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: ud_mlx5
#         Device: mlx5_3:1
#           Type: network
#  System device: mlx5_3 (3)
#
#      capabilities:
#            bandwidth: 3774.15/ppn + 0.00 MB/sec
#              latency: 1330 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 38
#     device num paths: 1
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
# Connection manager: rdmacm
#      max_conn_priv: 54 bytes
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#
#      Transport: cma
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: knem
#     Component: knem
#             register: unlimited, cost: 180 nsec
#           remote key: 16 bytes
#
#      Transport: knem
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 13862.00/ppn + 0.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 0 bytes
#       error handling: none
#
#
# Memory domain: xpmem
#     Component: xpmem
#             register: unlimited, cost: 60 nsec
#           remote key: 24 bytes
#           rkey_ptr is supported
#
#      Transport: xpmem
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 16 bytes
#       error handling: none
#
$ ompi_info
                 Package: Open MPI arcurtis@login1 Distribution
                Open MPI: 4.1.2
  Open MPI repo revision: v4.1.2
   Open MPI release date: Nov 24, 2021
                Open RTE: 4.1.2
  Open RTE repo revision: v4.1.2
   Open RTE release date: Nov 24, 2021
                    OPAL: 4.1.2
      OPAL repo revision: v4.1.2
       OPAL release date: Nov 24, 2021
                 MPI API: 3.1.0
            Ident string: 4.1.2
                  Prefix: /lustre/home/arcurtis/opt/./openmpi/gcc8/4.1.2
 Configured architecture: aarch64-unknown-linux-gnu
          Configure host: login1
           Configured by: arcurtis
           Configured on: Tue Feb 15 16:15:28 UTC 2022
          Configure host: login1
  Configure command line: '--prefix=/lustre/home/arcurtis/opt/./openmpi/gcc8/4.1.2'
                          '--with-knem=/opt/knem-1.1.3.90mlnx1'
                          '--with-xpmem=/opt/xpmem' '--with-lustre'
                          '--without-cuda' '--enable-mpi1-compatibility'
                          '--disable-debug' '--with-libevent=internal'
                          '--with-pmix=internal' '--with-hwloc=internal'
                          '--with-hcoll=/opt/mellanox/hcoll'
                          '--enable-mca-no-build=btl-uct'
                          '--enable-mpi-fortran=yes'
                          '--enable-oshmem-fortran=no'
                          '--with-ucx=/lustre/software/ucx/1.11.2'
                          '--enable-orterun-prefix-by-default'
                          '--without-verbs' '--without-ofi'
                Built by: arcurtis
                Built on: Tue Feb 15 16:26:50 UTC 2022
              Built host: login1
              C bindings: yes
            C++ bindings: no
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
                          limitations in the gfortran compiler and/or Open
                          MPI, does not support the following: array
                          subsections, direct passthru (where possible) to
                          underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
      C compiler version: 8.4.1
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
           Fort compiler: gfortran
       Fort compiler abs: /usr/bin/gfortran
         Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
          Fort PROTECTED: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
           C++ profiling: no
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, ORTE progress: yes, Event lib:
                          yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
 mpirun default --prefix: yes
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
      MPI1 compatibility: yes
          MPI extensions: affinity, cuda, pcollreq
   FT Checkpoint support: no (checkpoint thread: no)
   C/R Enabled Debugging: no
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.1.2)
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.1.2)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.2)
                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.2)
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.2)
            MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.1.2)
            MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.1.2)
               MCA event: libevent2022 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
               MCA hwloc: hwloc201 (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.1.2)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.1.2)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.1.2)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
                          v4.1.2)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA pstat: test (MCA v2.1.0, API v2.0.0, Component v4.1.2)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.1.2)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.1.2)
           MCA reachable: netlink (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.1.2)
              MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component
                          v4.1.2)
              MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component
                          v4.1.2)
              MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component
                          v4.1.2)
              MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component
                          v4.1.2)
                 MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component
                          v4.1.2)
                 MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.1.2)
               MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.1.2)
             MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
                MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.1.2)
                MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.1.2)
                MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.1.2)
               MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
               MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
               MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
               MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.1.2)
              MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.1.2)
              MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.1.2)
              MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.1.2)
              MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.1.2)
              MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.1.2)
              MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.1.2)
              MCA schizo: jsm (MCA v2.1.0, API v1.0.0, Component v4.1.2)
              MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.1.2)
               MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.1.2)
               MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.1.2)
               MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.1.2)
               MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.1.2)
               MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.1.2)
                 MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: adapt (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
                MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: han (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: hcoll (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                  MCA fs: lustre (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                  MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
                          v4.1.2)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.1.2)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
                          v4.1.2)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.1.2)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
tonycurtis commented 2 years ago

No idea why this got posted twice