open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.08k stars 844 forks source link

UCX Error: failed to parse address: number of ep addresses exceeds 6 #10687

Open omor1 opened 1 year ago

omor1 commented 1 year ago

Background information

What version of Open MPI are you using?

Describe how Open MPI was installed

Built via Spack

Build information

Please describe the system on which you are running

SDSC Expanse:


Details of the issue

I am running a Tiled Low-Rank Cholesky code with PaRSEC using Open MPI as the communication backend. The configuration is somewhat atypical for the problem and is rather inefficient for the problem size and scale; this can result in very many communications.

When using MPI_THREAD_MULTIPLE to use the compute threads to send remote task activations, I intermittently run into issues with UCX. This does not appear to occur when funneling all communications over a single thread. There are up to 127 compute threads per node, with one core reserved to process incoming communications and bulk data transfers. The remote task activations consist of fairly short messages sent using MPI_Send that are matched by tag and wildcard source at the receiver with a persistent receive.

The error I receive is below:

[1660714754.912708] [exp-3-27:2967907:0]         address.c:1588 UCX  ERROR failed to parse address: number of ep addresses exceeds 6
[exp-3-27:2967907] pml_ucx.c:421  Error: ucp_ep_create(proc=1) failed: Invalid parameter
[exp-3-27:2967907] pml_ucx.c:472  Error: Failed to resolve UCX endpoint for rank 1
[exp-3-27:2967907] *** An error occurred in MPI_Send
[exp-3-27:2967907] *** reported by process [929955840,6]
[exp-3-27:2967907] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[exp-3-27:2967907] *** MPI_ERR_OTHER: known error not in list
[exp-3-27:2967907] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[exp-3-27:2967907] ***    and potentially your MPI job)

My first guesses are the reason that this is intermittent is that this is some sort of race with delayed initialization or that many threads attempting to communicate concurrently causes some sort of issue.

Open MPI MCA parameters

OMPI_MCA_btl="self,vader"
OMPI_MCA_pml="ucx"
OMPI_MCA_osc="ucx"

UCX parameters

UCX_NET_DEVICES="mlx5_2:1"
UCX_TLS="shm,rc_x"
UCX_SEG_SIZE="12K"
UCX_IB_SEG_SIZE="12K"
UCX_IV_TX_MIN_INLINE="128"
UCX_IB_TX_INLINE_RESP="128"
UCX_IB_RX_INLINE="128"
UCX_IB_TX_QUEUE_LEN="1024"
UCX_IB_TX_MAX_BATCH="1"
UCX_IB_RCACHE_ADDR_ALIGN="4096"
UCX_IB_RCACHE_MAX_REGIONS="65536"
UCX_RC_TX_POLL_ALWAYS="y"
UCX_BCOPY_THRESH="128"
UCX_RNDV_THRESH="12K"
UCX_ZCOPY_THRESH="12K"
UCX_RNDV_SCHEME="put_zcopy"
UCX_UNIFIED_MODE="y"

The default parameters appeared to make some strange choices of when to change between different protocols—this set empirically appears to improve performance by a couple percent, though not very significantly, at least for the funneled communication mode. The only parameter that is strictly required is UCX_IB_RCACHE_MAX_REGIONS, see openucx/ucx#6264 for details.

omor1 commented 1 year ago

I have also gotten other UCX errors:

[exp-8-18:3546090] COPY-OPAL-VALUE: UNSUPPORTED TYPE 0
[exp-8-18:3546090] OPAL ERROR: Error in file base/pmix_base_hash.c at line 256
[1660858547.102648] [exp-8-18:3546090:0]         address.c:877  UCX  ERROR failed to unpack address, resource[21] is not valid
[exp-8-18:3546090] pml_ucx.c:421  Error: ucp_ep_create(proc=3) failed: Invalid parameter
[exp-8-18:3546090] pml_ucx.c:472  Error: Failed to resolve UCX endpoint for rank 3
[exp-8-18:3546090] *** An error occurred in MPI_Send
[exp-8-18:3546090] *** reported by process [3131834368,12]
[exp-8-18:3546090] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[exp-8-18:3546090] *** MPI_ERR_OTHER: known error not in list
[exp-8-18:3546090] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[exp-8-18:3546090] ***    and potentially your MPI job)
brminich commented 1 year ago

Can you please clarify a couple of questions?

omor1 commented 1 year ago
  • Is this a regression (did you run the same task with older OMPI/UCX)?

I have not yet attempted to replicate this particular issue with older OMPI/UCX.

  • Please try to reproduce it without UCX_UNIFIED_MODE="y" or set it explicitly to UCX_UNIFIED_MODE="n"

Using the same parameters with UCX_UNIFIED_MODE="n" I got the following crash:

[1660942324.015975] [exp-13-16:933027:0]          ucp_ep.c:980  UCX  ERROR the parameter params->address must not be NULL
[exp-13-16:933027] pml_ucx.c:421  Error: ucp_ep_create(proc=3) failed: Invalid parameter
[exp-13-16:933027] pml_ucx.c:472  Error: Failed to resolve UCX endpoint for rank 3
[exp-13-16:933027] *** An error occurred in MPI_Send
[exp-13-16:933027] *** reported by process [51904512,12]
[exp-13-16:933027] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[exp-13-16:933027] *** MPI_ERR_OTHER: known error not in list
[exp-13-16:933027] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[exp-13-16:933027] ***    and potentially your MPI job)

I also had a successful run, but have had successful runs with UCX_UNIFIED_MODE="y" too—this doesn't always occur.

Using the default parameters and UCX_IB_RCACHE_MAX_REGIONS="16777216" (limiting cache entries to maximum number of hardware registrations) I don't think I've seen this issue. Scratch that, I just replicated this using stock UCX parameters, except for UCX_IB_RCACHE_MAX_REGIONS="65536" which is necessary to avoid ibv_reg_mr(address=0x15445450f010, length=384000, access=0x10000f) failed: Cannot allocate memory.

I have only seen this with MPI_THREAD_MULTIPLE though; the combination with its sporadicity and inconsistency leads me to my speculation of a race condition.

  • Provide output of ucx_info -d from each machine (afaiu you use only 2 nodes)

I'm running with 16 nodes. Since I run on a public cluster I get a different set of nodes each time—but notably, between the failing and succeeding run, I did get the same exact nodes. ucx_info -d for some arbitrary node is:

#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: self
#         Device: memory0
#           Type: loopback
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: tcp
#         Device: eno33.450
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 2954.51/ppn + 0.00 MB/sec
#              latency: 5223 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 0
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: lo
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: eno33
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 2954.51/ppn + 0.00 MB/sec
#              latency: 5223 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: ib0
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11522.81/ppn + 0.00 MB/sec
#              latency: 5206 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#
#      Transport: sysv
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: posix
#     Component: posix
#             allocate: <= 131844956K
#           remote key: 24 bytes
#           rkey_ptr is supported
#
#      Transport: posix
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: mlx5_0
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#
#      Transport: dc_mlx5
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (0)
#
#      capabilities:
#            bandwidth: 2916.16/ppn + 0.00 MB/sec
#              latency: 860 nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 11 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 11 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 138
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 38
#     device num paths: 1
#              max eps: inf
#       device address: 17 bytes
#        iface address: 5 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: rc_verbs
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (0)
#
#      capabilities:
#            bandwidth: 2916.16/ppn + 0.00 MB/sec
#              latency: 800 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 5 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 5 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 4 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 38
#     device num paths: 1
#              max eps: 256
#       device address: 17 bytes
#           ep address: 5 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: rc_mlx5
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (0)
#
#      capabilities:
#            bandwidth: 2916.16/ppn + 0.00 MB/sec
#              latency: 800 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 186
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to ep
#      device priority: 38
#     device num paths: 1
#              max eps: 256
#       device address: 17 bytes
#           ep address: 7 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: ud_verbs
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (0)
#
#      capabilities:
#            bandwidth: 2916.16/ppn + 0.00 MB/sec
#              latency: 830 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 3952
#           connection: to ep, to iface
#      device priority: 38
#     device num paths: 1
#              max eps: inf
#       device address: 17 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: ud_mlx5
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (0)
#
#      capabilities:
#            bandwidth: 2916.16/ppn + 0.00 MB/sec
#              latency: 830 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 38
#     device num paths: 1
#              max eps: inf
#       device address: 17 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: mlx5_1
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#   < no supported devices found >
#
# Memory domain: mlx5_2
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#           memory invalidation is supported
#
#      Transport: dc_mlx5
#         Device: mlx5_2:1
#           Type: network
#  System device: mlx5_2 (2)
#
#      capabilities:
#            bandwidth: 11794.23/ppn + 0.00 MB/sec
#              latency: 660 nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 11 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 11 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 138
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 50
#     device num paths: 1
#              max eps: inf
#       device address: 3 bytes
#        iface address: 5 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: rc_verbs
#         Device: mlx5_2:1
#           Type: network
#  System device: mlx5_2 (2)
#
#      capabilities:
#            bandwidth: 11794.23/ppn + 0.00 MB/sec
#              latency: 600 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 5 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 5 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 4 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 50
#     device num paths: 1
#              max eps: 256
#       device address: 3 bytes
#           ep address: 5 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: rc_mlx5
#         Device: mlx5_2:1
#           Type: network
#  System device: mlx5_2 (2)
#
#      capabilities:
#            bandwidth: 11794.23/ppn + 0.00 MB/sec
#              latency: 600 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 186
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to ep
#      device priority: 50
#     device num paths: 1
#              max eps: 256
#       device address: 3 bytes
#           ep address: 7 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: ud_verbs
#         Device: mlx5_2:1
#           Type: network
#  System device: mlx5_2 (2)
#
#      capabilities:
#            bandwidth: 11794.23/ppn + 0.00 MB/sec
#              latency: 630 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 3952
#           connection: to ep, to iface
#      device priority: 50
#     device num paths: 1
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: ud_mlx5
#         Device: mlx5_2:1
#           Type: network
#  System device: mlx5_2 (2)
#
#      capabilities:
#            bandwidth: 11794.23/ppn + 0.00 MB/sec
#              latency: 630 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 50
#     device num paths: 1
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
# Connection manager: rdmacm
#      max_conn_priv: 54 bytes
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#
#      Transport: cma
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: knem
#     Component: knem
#             register: unlimited, cost: 180 nsec
#           remote key: 16 bytes
#
#      Transport: knem
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 13862.00/ppn + 0.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 0 bytes
#       error handling: none
#
#
# Memory domain: xpmem
#     Component: xpmem
#             register: unlimited, cost: 60 nsec
#           remote key: 24 bytes
#           rkey_ptr is supported
#
#      Transport: xpmem
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 16 bytes
#       error handling: none
#

Note that the correct HCA to use is mlx5_2; while there are two other ConnectX-5 HCAs, one (mlx5_1) isn't connected and is configured for 10GbE while the other (mlx5_0) is configured for 25GbE

omor1 commented 1 year ago

Some additional context is that this appears to be related to issues in dynamically connecting processes; setting the mpi_preconnect_mpi MCA option avoids these issues. According to @bosilca there have been some issues in PRRTE recently regarding this. I'm not sure why it only shows up with MPI_THREAD_MULTIPLE, but I might guess that it's a race between connecting the processes and attempting to communicate with them.

rhc54 commented 1 year ago

Why would PRRTE be involved in this? We don't have anything to do with HCA selection.

bosilca commented 1 year ago

For at least two reasons: 1. without preconnect the modex is asynchronous, and 2. the error obtained indicates the inability of the communication to get the info about the peer (which also suggests that the modex might not yet be available at the moment of the connection establishment).

rhc54 commented 1 year ago

Kewl - but isn't that a question of how and when OMPI does its modex? I could see if PRRTE decided when it was going to do the modex, or if PRRTE decided what info to include in it or whether to do it async or not. But PRRTE doesn't do any of that - that's all controlled by you folks during MPI_Init.

karasevb commented 1 year ago

@omor1 can it be reproduced with "--mca pml ob1"? And could you please provide some simple reproducer of the problem?