Open omor1 opened 1 year ago
I have also gotten other UCX errors:
[exp-8-18:3546090] COPY-OPAL-VALUE: UNSUPPORTED TYPE 0
[exp-8-18:3546090] OPAL ERROR: Error in file base/pmix_base_hash.c at line 256
[1660858547.102648] [exp-8-18:3546090:0] address.c:877 UCX ERROR failed to unpack address, resource[21] is not valid
[exp-8-18:3546090] pml_ucx.c:421 Error: ucp_ep_create(proc=3) failed: Invalid parameter
[exp-8-18:3546090] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 3
[exp-8-18:3546090] *** An error occurred in MPI_Send
[exp-8-18:3546090] *** reported by process [3131834368,12]
[exp-8-18:3546090] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[exp-8-18:3546090] *** MPI_ERR_OTHER: known error not in list
[exp-8-18:3546090] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[exp-8-18:3546090] *** and potentially your MPI job)
Can you please clarify a couple of questions?
UCX_UNIFIED_MODE="y"
or set it explicitly to UCX_UNIFIED_MODE="n"
ucx_info -d
from each machine (afaiu you use only 2 nodes)
- Is this a regression (did you run the same task with older OMPI/UCX)?
I have not yet attempted to replicate this particular issue with older OMPI/UCX.
- Please try to reproduce it without
UCX_UNIFIED_MODE="y"
or set it explicitly toUCX_UNIFIED_MODE="n"
Using the same parameters with UCX_UNIFIED_MODE="n"
I got the following crash:
[1660942324.015975] [exp-13-16:933027:0] ucp_ep.c:980 UCX ERROR the parameter params->address must not be NULL
[exp-13-16:933027] pml_ucx.c:421 Error: ucp_ep_create(proc=3) failed: Invalid parameter
[exp-13-16:933027] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 3
[exp-13-16:933027] *** An error occurred in MPI_Send
[exp-13-16:933027] *** reported by process [51904512,12]
[exp-13-16:933027] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[exp-13-16:933027] *** MPI_ERR_OTHER: known error not in list
[exp-13-16:933027] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[exp-13-16:933027] *** and potentially your MPI job)
I also had a successful run, but have had successful runs with UCX_UNIFIED_MODE="y"
too—this doesn't always occur.
Using the default parameters and
Scratch that, I just replicated this using stock UCX parameters, except for UCX_IB_RCACHE_MAX_REGIONS="16777216"
(limiting cache entries to maximum number of hardware registrations) I don't think I've seen this issue.UCX_IB_RCACHE_MAX_REGIONS="65536"
which is necessary to avoid ibv_reg_mr(address=0x15445450f010, length=384000, access=0x10000f) failed: Cannot allocate memory
.
I have only seen this with MPI_THREAD_MULTIPLE
though; the combination with its sporadicity and inconsistency leads me to my speculation of a race condition.
- Provide output of
ucx_info -d
from each machine (afaiu you use only 2 nodes)
I'm running with 16 nodes. Since I run on a public cluster I get a different set of nodes each time—but notably, between the failing and succeeding run, I did get the same exact nodes. ucx_info -d
for some arbitrary node is:
#
# Memory domain: self
# Component: self
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
#
# Transport: self
# Device: memory0
# Type: loopback
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 6911.00 MB/sec
# latency: 0 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 8K
# am_bcopy: <= 8K
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: tcp
# Component: tcp
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
#
# Transport: tcp
# Device: eno33.450
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 2954.51/ppn + 0.00 MB/sec
# latency: 5223 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 0
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
# Transport: tcp
# Device: lo
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.91/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 18 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
# Transport: tcp
# Device: eno33
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 2954.51/ppn + 0.00 MB/sec
# latency: 5223 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
# Transport: tcp
# Device: ib0
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 11522.81/ppn + 0.00 MB/sec
# latency: 5206 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
# max_conn_priv: 2064 bytes
#
# Memory domain: sysv
# Component: sysv
# allocate: unlimited
# remote key: 12 bytes
# rkey_ptr is supported
#
# Transport: sysv
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 12179.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: posix
# Component: posix
# allocate: <= 131844956K
# remote key: 24 bytes
# rkey_ptr is supported
#
# Transport: posix
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 12179.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: mlx5_0
# Component: ib
# register: unlimited, cost: 180 nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
#
# Transport: dc_mlx5
# Device: mlx5_0:1
# Type: network
# System device: mlx5_0 (0)
#
# capabilities:
# bandwidth: 2916.16/ppn + 0.00 MB/sec
# latency: 860 nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 11 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 11 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 138
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 38
# device num paths: 1
# max eps: inf
# device address: 17 bytes
# iface address: 5 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: rc_verbs
# Device: mlx5_0:1
# Type: network
# System device: mlx5_0 (0)
#
# capabilities:
# bandwidth: 2916.16/ppn + 0.00 MB/sec
# latency: 800 + 1.000 * N nsec
# overhead: 75 nsec
# put_short: <= 124
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 5 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 5 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 123
# am_bcopy: <= 8255
# am_zcopy: <= 8255, up to 4 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 127
# domain: device
# atomic_add: 64 bit
# atomic_fadd: 64 bit
# atomic_cswap: 64 bit
# connection: to ep
# device priority: 38
# device num paths: 1
# max eps: 256
# device address: 17 bytes
# ep address: 5 bytes
# error handling: peer failure, ep_check
#
#
# Transport: rc_mlx5
# Device: mlx5_0:1
# Type: network
# System device: mlx5_0 (0)
#
# capabilities:
# bandwidth: 2916.16/ppn + 0.00 MB/sec
# latency: 800 + 1.000 * N nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 14 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 14 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 186
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to ep
# device priority: 38
# device num paths: 1
# max eps: 256
# device address: 17 bytes
# ep address: 7 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: ud_verbs
# Device: mlx5_0:1
# Type: network
# System device: mlx5_0 (0)
#
# capabilities:
# bandwidth: 2916.16/ppn + 0.00 MB/sec
# latency: 830 nsec
# overhead: 105 nsec
# am_short: <= 116
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 5 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 3952
# connection: to ep, to iface
# device priority: 38
# device num paths: 1
# max eps: inf
# device address: 17 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Transport: ud_mlx5
# Device: mlx5_0:1
# Type: network
# System device: mlx5_0 (0)
#
# capabilities:
# bandwidth: 2916.16/ppn + 0.00 MB/sec
# latency: 830 nsec
# overhead: 80 nsec
# am_short: <= 180
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 132
# connection: to ep, to iface
# device priority: 38
# device num paths: 1
# max eps: inf
# device address: 17 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Memory domain: mlx5_1
# Component: ib
# register: unlimited, cost: 180 nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
# < no supported devices found >
#
# Memory domain: mlx5_2
# Component: ib
# register: unlimited, cost: 180 nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
# memory invalidation is supported
#
# Transport: dc_mlx5
# Device: mlx5_2:1
# Type: network
# System device: mlx5_2 (2)
#
# capabilities:
# bandwidth: 11794.23/ppn + 0.00 MB/sec
# latency: 660 nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 11 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 11 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 138
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 50
# device num paths: 1
# max eps: inf
# device address: 3 bytes
# iface address: 5 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: rc_verbs
# Device: mlx5_2:1
# Type: network
# System device: mlx5_2 (2)
#
# capabilities:
# bandwidth: 11794.23/ppn + 0.00 MB/sec
# latency: 600 + 1.000 * N nsec
# overhead: 75 nsec
# put_short: <= 124
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 5 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 5 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 123
# am_bcopy: <= 8255
# am_zcopy: <= 8255, up to 4 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 127
# domain: device
# atomic_add: 64 bit
# atomic_fadd: 64 bit
# atomic_cswap: 64 bit
# connection: to ep
# device priority: 50
# device num paths: 1
# max eps: 256
# device address: 3 bytes
# ep address: 5 bytes
# error handling: peer failure, ep_check
#
#
# Transport: rc_mlx5
# Device: mlx5_2:1
# Type: network
# System device: mlx5_2 (2)
#
# capabilities:
# bandwidth: 11794.23/ppn + 0.00 MB/sec
# latency: 600 + 1.000 * N nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 14 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 14 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 186
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to ep
# device priority: 50
# device num paths: 1
# max eps: 256
# device address: 3 bytes
# ep address: 7 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: ud_verbs
# Device: mlx5_2:1
# Type: network
# System device: mlx5_2 (2)
#
# capabilities:
# bandwidth: 11794.23/ppn + 0.00 MB/sec
# latency: 630 nsec
# overhead: 105 nsec
# am_short: <= 116
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 5 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 3952
# connection: to ep, to iface
# device priority: 50
# device num paths: 1
# max eps: inf
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Transport: ud_mlx5
# Device: mlx5_2:1
# Type: network
# System device: mlx5_2 (2)
#
# capabilities:
# bandwidth: 11794.23/ppn + 0.00 MB/sec
# latency: 630 nsec
# overhead: 80 nsec
# am_short: <= 180
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 132
# connection: to ep, to iface
# device priority: 50
# device num paths: 1
# max eps: inf
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Connection manager: rdmacm
# max_conn_priv: 54 bytes
#
# Memory domain: cma
# Component: cma
# register: unlimited, cost: 9 nsec
#
# Transport: cma
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 11145.00 MB/sec
# latency: 80 nsec
# overhead: 2000 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 4 bytes
# error handling: peer failure, ep_check
#
#
# Memory domain: knem
# Component: knem
# register: unlimited, cost: 180 nsec
# remote key: 16 bytes
#
# Transport: knem
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 13862.00/ppn + 0.00 MB/sec
# latency: 80 nsec
# overhead: 2000 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 0 bytes
# error handling: none
#
#
# Memory domain: xpmem
# Component: xpmem
# register: unlimited, cost: 60 nsec
# remote key: 24 bytes
# rkey_ptr is supported
#
# Transport: xpmem
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 12179.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 16 bytes
# error handling: none
#
Note that the correct HCA to use is mlx5_2
; while there are two other ConnectX-5 HCAs, one (mlx5_1
) isn't connected and is configured for 10GbE while the other (mlx5_0
) is configured for 25GbE
Some additional context is that this appears to be related to issues in dynamically connecting processes; setting the mpi_preconnect_mpi
MCA option avoids these issues. According to @bosilca there have been some issues in PRRTE recently regarding this. I'm not sure why it only shows up with MPI_THREAD_MULTIPLE
, but I might guess that it's a race between connecting the processes and attempting to communicate with them.
Why would PRRTE be involved in this? We don't have anything to do with HCA selection.
For at least two reasons: 1. without preconnect the modex is asynchronous, and 2. the error obtained indicates the inability of the communication to get the info about the peer (which also suggests that the modex might not yet be available at the moment of the connection establishment).
Kewl - but isn't that a question of how and when OMPI does its modex? I could see if PRRTE decided when it was going to do the modex, or if PRRTE decided what info to include in it or whether to do it async or not. But PRRTE doesn't do any of that - that's all controlled by you folks during MPI_Init.
@omor1 can it be reproduced with "--mca pml ob1"? And could you please provide some simple reproducer of the problem?
Background information
What version of Open MPI are you using?
Describe how Open MPI was installed
Built via Spack
Build information
-O3
and LTO for both Open MPI and UCXPlease describe the system on which you are running
SDSC Expanse:
Details of the issue
I am running a Tiled Low-Rank Cholesky code with PaRSEC using Open MPI as the communication backend. The configuration is somewhat atypical for the problem and is rather inefficient for the problem size and scale; this can result in very many communications.
When using
MPI_THREAD_MULTIPLE
to use the compute threads to send remote task activations, I intermittently run into issues with UCX. This does not appear to occur when funneling all communications over a single thread. There are up to 127 compute threads per node, with one core reserved to process incoming communications and bulk data transfers. The remote task activations consist of fairly short messages sent usingMPI_Send
that are matched by tag and wildcard source at the receiver with a persistent receive.The error I receive is below:
My first guesses are the reason that this is intermittent is that this is some sort of race with delayed initialization or that many threads attempting to communicate concurrently causes some sort of issue.
Open MPI MCA parameters
UCX parameters
The default parameters appeared to make some strange choices of when to change between different protocols—this set empirically appears to improve performance by a couple percent, though not very significantly, at least for the funneled communication mode. The only parameter that is strictly required is
UCX_IB_RCACHE_MAX_REGIONS
, see openucx/ucx#6264 for details.