Mixing UCX pml and osc performance degradation

devreal commented 3 years ago

I have been working on an experimental extension to MPI that is built on top of osc/ucx. Unfortunately, during experiments with PaRSEC I found that the combination of osc/ucx and pml/ucx leads to a significant performance degradation. I have not been able to reproduce this with a small example but I have been able to observe this behavior in a modified PaRSEC runtime where the data payload is transferred through both P2P and RMA operations.

In the PaRSEC runtime, a single thread manages the communication of all task inputs/outputs by sending active messages and posting nonblocking sends and receives. The active messages may potentially lead to a high number of unexpected messages. In the below figure, this is PaRSEC vanilla. A certain loss in per-node performance in such a relatively small Cholesky factorization with small tiles is expected as more nodes require more active messages and the communication becomes dominant.

I have modified the runtime such that the sent payload is also put into a preallocated window, essentially doubling the amount of data transferred (PaRSEC P2P+RMA). This leads to a significant drop in performance at 64 nodes when using pml/ucx. Inte With pml/ob1 (probably using btl/tcp) the base line is much lower but the slope matches that of vanilla PaRSEC, outperforming pml/ucx at 64 nodes.

To make sure this is not a bandwidth issue, I used a second modified version in which the data is sent twice using send/recv (PaRSEC 2xP2P). In this case, the performance is mostly similar to the vanilla runtime. This leads me to conclude that this is not a bandwidth issue.

dpotrf_1222148-1

I had manually instrumented osc/ucx to warn about long calls into UCX and found that calls to ucp_put_nbi repeatedly took several milliseconds (up to approx. 20ms for a single call), which is likely the reason for the performance drop as other processes are starving while one is stuck in a put. However, I would like to know what is happening there as I want to combine P2P and RMA going forward. My theory of the case is that somehow the unexpected messages are handled inside the call to put. Interestingly though, the effect seems to be worse when the transfer size is increased (e.g., by increasing the tile size from 128 to 320). Any idea how I can find out what is happening inside UCX?

The system under test is a HPE Apollo with dual-socket AMD Rome 64 core processors with a HDR200 interconnect. Everything is compiled using GCC 10.2.0.

This was tested with a recent Open MPI master (two weeks ago, I have not seen any changes that may have had an impact here). I have observed similar behavior with 4.1.1.

UCX 1.10.0 configuration:

$ ~/opt-hawk/ucx-1.10.0/bin/ucx_info -d
#
# Memory domain: mlx5_1
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#
#      Transport: rc_verbs
#         Device: mlx5_1:1
#  System device: 0000:a3:00.0 (0)
#
#      capabilities:
#            bandwidth: 23588.47/ppn + 0.00 MB/sec
#              latency: 600 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 8 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 8 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 7 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 50
#     device num paths: 1
#              max eps: 256
#       device address: 11 bytes
#           ep address: 17 bytes
#       error handling: peer failure
#
#
#      Transport: rc_mlx5
#         Device: mlx5_1:1
#  System device: 0000:a3:00.0 (0)
#
#      capabilities:
#            bandwidth: 23588.47/ppn + 0.00 MB/sec
#              latency: 600 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 186
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to ep
#      device priority: 50
#     device num paths: 1
#              max eps: 256
#       device address: 11 bytes
#           ep address: 7 bytes
#       error handling: buffer (zcopy), remote access, peer failure
#
#
#      Transport: dc_mlx5
#         Device: mlx5_1:1
#  System device: 0000:a3:00.0 (0)
#
#      capabilities:
#            bandwidth: 23588.47/ppn + 0.00 MB/sec
#              latency: 660 nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 11 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 11 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 138
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 50
#     device num paths: 1
#              max eps: inf
#       device address: 11 bytes
#        iface address: 5 bytes
#       error handling: buffer (zcopy), remote access, peer failure
#
#
#      Transport: ud_verbs
#         Device: mlx5_1:1
#  System device: 0000:a3:00.0 (0)
#
#      capabilities:
#            bandwidth: 23588.47/ppn + 0.00 MB/sec
#              latency: 630 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 7 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 3952
#           connection: to ep, to iface
#      device priority: 50
#     device num paths: 1
#              max eps: inf
#       device address: 11 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure
#
#
#      Transport: ud_mlx5
#         Device: mlx5_1:1
#  System device: 0000:a3:00.0 (0)
#
#      capabilities:
#            bandwidth: 23588.47/ppn + 0.00 MB/sec
#              latency: 630 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 50
#     device num paths: 1
#              max eps: inf
#       device address: 11 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure
#
#
# Memory domain: mlx5_0
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#
#      Transport: rc_verbs
#         Device: mlx5_0:1
#  System device: 0000:43:00.0 (1)
#
#      capabilities:
#            bandwidth: 23588.47/ppn + 0.00 MB/sec
#              latency: 600 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 8 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 8 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 7 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 50
#     device num paths: 1
#              max eps: 256
#       device address: 11 bytes
#           ep address: 17 bytes
#       error handling: peer failure
#
#
#      Transport: rc_mlx5
#         Device: mlx5_0:1
#  System device: 0000:43:00.0 (1)
#
#      capabilities:
#            bandwidth: 23588.47/ppn + 0.00 MB/sec
#              latency: 600 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 186
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to ep
#      device priority: 50
#     device num paths: 1
#              max eps: 256
#       device address: 11 bytes
#           ep address: 7 bytes
#       error handling: buffer (zcopy), remote access, peer failure
#
#
#      Transport: dc_mlx5
#         Device: mlx5_0:1
#  System device: 0000:43:00.0 (1)
#
#      capabilities:
#            bandwidth: 23588.47/ppn + 0.00 MB/sec
#              latency: 660 nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 11 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 11 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 138
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 50
#     device num paths: 1
#              max eps: inf
#       device address: 11 bytes
#        iface address: 5 bytes
#       error handling: buffer (zcopy), remote access, peer failure
#
#
#      Transport: ud_verbs
#         Device: mlx5_0:1
#  System device: 0000:43:00.0 (1)
#
#      capabilities:
#            bandwidth: 23588.47/ppn + 0.00 MB/sec
#              latency: 630 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 7 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 3952
#           connection: to ep, to iface
#      device priority: 50
#     device num paths: 1
#              max eps: inf
#       device address: 11 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure
#
#
#      Transport: ud_mlx5
#         Device: mlx5_0:1
#  System device: 0000:43:00.0 (1)
#
#      capabilities:
#            bandwidth: 23588.47/ppn + 0.00 MB/sec
#              latency: 630 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 50
#     device num paths: 1
#              max eps: inf
#       device address: 11 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure
#
#
# Memory domain: rdmacm
#     Component: rdmacm
#           supports client-server connection establishment via sockaddr
#   < no supported devices found >
#
# Connection manager: rdmacm
#      max_conn_priv: 54 bytes
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#
#      Transport: cma
#         Device: memory
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 400 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: none
#

$ ~/opt-hawk/ucx-1.10.0/bin/ucx_info -b
#define UCX_CONFIG_H              
#define ENABLE_ASSERT             1
#define ENABLE_BUILTIN_MEMCPY     1
#define ENABLE_DEBUG_DATA         0
#define ENABLE_MT                 1
#define ENABLE_PARAMS_CHECK       1
#define ENABLE_STATS              1
#define ENABLE_SYMBOL_OVERRIDE    1
#define HAVE_ALLOCA               1
#define HAVE_ALLOCA_H             1
#define HAVE_ATTRIBUTE_NOOPTIMIZE 1
#define HAVE_CLEARENV             1
#define HAVE_CPU_SET_T            1
#define HAVE_DC_EXP               1
#define HAVE_DECL_ASPRINTF        1
#define HAVE_DECL_BASENAME        1
#define HAVE_DECL_CPU_ISSET       1
#define HAVE_DECL_CPU_ZERO        1
#define HAVE_DECL_ETHTOOL_CMD_SPEED 1
#define HAVE_DECL_FMEMOPEN        1
#define HAVE_DECL_F_SETOWN_EX     1
#define HAVE_DECL_IBV_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_ACCESS_RELAXED_ORDERING 0
#define HAVE_DECL_IBV_ADVISE_MR   0
#define HAVE_DECL_IBV_ALLOC_DM    0
#define HAVE_DECL_IBV_ALLOC_TD    0
#define HAVE_DECL_IBV_CMD_MODIFY_QP 1
#define HAVE_DECL_IBV_CREATE_CQ_ATTR_IGNORE_OVERRUN 0
#define HAVE_DECL_IBV_CREATE_QP_EX 1
#define HAVE_DECL_IBV_CREATE_SRQ  1
#define HAVE_DECL_IBV_CREATE_SRQ_EX 1
#define HAVE_DECL_IBV_EVENT_GID_CHANGE 1
#define HAVE_DECL_IBV_EVENT_TYPE_STR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ALLOCATE_MR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_EXP_ALLOC_DM 1
#define HAVE_DECL_IBV_EXP_ATOMIC_HCA_REPLY_BE 1
#define HAVE_DECL_IBV_EXP_CQ_IGNORE_OVERRUN 1
#define HAVE_DECL_IBV_EXP_CQ_MODERATION 1
#define HAVE_DECL_IBV_EXP_CREATE_QP 1
#define HAVE_DECL_IBV_EXP_CREATE_RES_DOMAIN 1
#define HAVE_DECL_IBV_EXP_CREATE_SRQ 1
#define HAVE_DECL_IBV_EXP_DCT_OOO_RW_DATA_PLACEMENT 1
#define HAVE_DECL_IBV_EXP_DESTROY_RES_DOMAIN 1
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_PCI_ATOMIC_CAPS 1
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_RESERVED_2 1
#define HAVE_DECL_IBV_EXP_DEVICE_DC_TRANSPORT 1
#define HAVE_DECL_IBV_EXP_DEVICE_MR_ALLOCATE 1
#define HAVE_DECL_IBV_EXP_MR_FIXED_BUFFER_SIZE 1
#define HAVE_DECL_IBV_EXP_MR_INDIRECT_KLMS 1
#define HAVE_DECL_IBV_EXP_ODP_SUPPORT_IMPLICIT 1
#define HAVE_DECL_IBV_EXP_POST_SEND 1
#define HAVE_DECL_IBV_EXP_PREFETCH_MR 1
#define HAVE_DECL_IBV_EXP_PREFETCH_WRITE_ACCESS 1
#define HAVE_DECL_IBV_EXP_QPT_DC_INI 1
#define HAVE_DECL_IBV_EXP_QP_CREATE_UMR 1
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG 1
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_RES_DOMAIN 1
#define HAVE_DECL_IBV_EXP_QP_OOO_RW_DATA_PLACEMENT 1
#define HAVE_DECL_IBV_EXP_QUERY_DEVICE 1
#define HAVE_DECL_IBV_EXP_QUERY_GID_ATTR 1
#define HAVE_DECL_IBV_EXP_REG_MR  1
#define HAVE_DECL_IBV_EXP_RES_DOMAIN_THREAD_MODEL 1
#define HAVE_DECL_IBV_EXP_SEND_EXT_ATOMIC_INLINE 1
#define HAVE_DECL_IBV_EXP_SETENV  1
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_CMP_AND_SWP 1
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_FETCH_AND_ADD 1
#define HAVE_DECL_IBV_EXP_WR_NOP  1
#define HAVE_DECL_IBV_GET_ASYNC_EVENT 1
#define HAVE_DECL_IBV_GET_DEVICE_NAME 1
#define HAVE_DECL_IBV_LINK_LAYER_ETHERNET 1
#define HAVE_DECL_IBV_LINK_LAYER_INFINIBAND 1
#define HAVE_DECL_IBV_MLX5_EXP_GET_CQ_INFO 1
#define HAVE_DECL_IBV_MLX5_EXP_GET_QP_INFO 1
#define HAVE_DECL_IBV_MLX5_EXP_GET_SRQ_INFO 1
#define HAVE_DECL_IBV_MLX5_EXP_UPDATE_CQ_CI 1
#define HAVE_DECL_IBV_ODP_SUPPORT_IMPLICIT 0
#define HAVE_DECL_IBV_QPF_GRH_REQUIRED 0
#define HAVE_DECL_IBV_QUERY_DEVICE_EX 1
#define HAVE_DECL_IBV_QUERY_GID   1
#define HAVE_DECL_IBV_WC_STATUS_STR 1
#define HAVE_DECL_IPPROTO_TCP     1
#define HAVE_DECL_MADV_FREE       1
#define HAVE_DECL_MADV_REMOVE     1
#define HAVE_DECL_MLX5_WQE_CTRL_SOLICITED 1
#define HAVE_DECL_POSIX_MADV_DONTNEED 1
#define HAVE_DECL_PR_SET_PTRACER  1
#define HAVE_DECL_RDMA_ESTABLISH  1
#define HAVE_DECL_RDMA_INIT_QP_ATTR 1
#define HAVE_DECL_SOL_SOCKET      1
#define HAVE_DECL_SO_KEEPALIVE    1
#define HAVE_DECL_SPEED_UNKNOWN   1
#define HAVE_DECL_STRERROR_R      1
#define HAVE_DECL_SYS_BRK         1
#define HAVE_DECL_SYS_IPC         0
#define HAVE_DECL_SYS_MADVISE     1
#define HAVE_DECL_SYS_MMAP        1
#define HAVE_DECL_SYS_MREMAP      1
#define HAVE_DECL_SYS_MUNMAP      1
#define HAVE_DECL_SYS_SHMAT       1
#define HAVE_DECL_SYS_SHMDT       1
#define HAVE_DECL_TCP_KEEPCNT     1
#define HAVE_DECL_TCP_KEEPIDLE    1
#define HAVE_DECL_TCP_KEEPINTVL   1
#define HAVE_DECL___PPC_GET_TIMEBASE_FREQ 0
#define HAVE_DLFCN_H              1
#define HAVE_EXP_UMR              1
#define HAVE_EXP_UMR_KSM          1
#define HAVE_HW_TIMER             1
#define HAVE_IB                   1
#define HAVE_IBV_DM               1
#define HAVE_IBV_EXP_DM           1
#define HAVE_IBV_EXP_QP_CREATE_UMR 1
#define HAVE_IBV_EXP_RES_DOMAIN   1
#define HAVE_IB_EXT_ATOMICS       1
#define HAVE_IN6_ADDR_S6_ADDR32   1
#define HAVE_INFINIBAND_MLX5_HW_H 1
#define HAVE_INTTYPES_H           1
#define HAVE_IP_IP_DST            1
#define HAVE_LIBGEN_H             1
#define HAVE_LIBRT                1
#define HAVE_LINUX_FUTEX_H        1
#define HAVE_LINUX_IP_H           1
#define HAVE_LINUX_MMAN_H         1
#define HAVE_MALLOC_H             1
#define HAVE_MALLOC_HOOK          1
#define HAVE_MALLOC_TRIM          1
#define HAVE_MASKED_ATOMICS_ENDIANNESS 1
#define HAVE_MEMALIGN             1
#define HAVE_MEMORY_H             1
#define HAVE_MLX5_HW              1
#define HAVE_MLX5_HW_UD           1
#define HAVE_MREMAP               1
#define HAVE_NETINET_IP_H         1
#define HAVE_NET_ETHERNET_H       1
#define HAVE_NUMA                 1
#define HAVE_NUMAIF_H             1
#define HAVE_NUMA_H               1
#define HAVE_ODP                  1
#define HAVE_ODP_IMPLICIT         1
#define HAVE_POSIX_MEMALIGN       1
#define HAVE_PREFETCH             1
#define HAVE_PROFILING            1
#define HAVE_RDMACM_QP_LESS       1
#define HAVE_SCHED_GETAFFINITY    1
#define HAVE_SCHED_SETAFFINITY    1
#define HAVE_SIGACTION_SA_RESTORER 1
#define HAVE_SIGEVENT_SIGEV_UN_TID 1
#define HAVE_SIGHANDLER_T         1
#define HAVE_STDINT_H             1
#define HAVE_STDLIB_H             1
#define HAVE_STRERROR_R           1
#define HAVE_STRINGS_H            1
#define HAVE_STRING_H             1
#define HAVE_STRUCT_BITMASK       1
#define HAVE_STRUCT_IBV_ASYNC_EVENT_ELEMENT_DCT 1
#define HAVE_STRUCT_IBV_EXP_CREATE_SRQ_ATTR_DC_OFFLOAD_PARAMS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_EXP_DEVICE_CAP_FLAGS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_CAPS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_CAPS_PER_TRANSPORT_CAPS_DC_ODP_CAPS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_MR_MAX_SIZE 1
#define HAVE_STRUCT_IBV_EXP_QP_INIT_ATTR_MAX_INL_RECV 1
#define HAVE_STRUCT_IBV_MLX5_QP_INFO_BF_NEED_LOCK 1
#define HAVE_STRUCT_MLX5_AH_IBV_AH 1
#define HAVE_STRUCT_MLX5_CQE64_IB_STRIDE_INDEX 1
#define HAVE_STRUCT_MLX5_GRH_AV_RMAC 1
#define HAVE_STRUCT_MLX5_SRQ_CMD_QP 1
#define HAVE_STRUCT_MLX5_WQE_AV_BASE 1
#define HAVE_SYS_EPOLL_H          1
#define HAVE_SYS_EVENTFD_H        1
#define HAVE_SYS_STAT_H           1
#define HAVE_SYS_TYPES_H          1
#define HAVE_SYS_UIO_H            1
#define HAVE_TL_DC                1
#define HAVE_TL_RC                1
#define HAVE_TL_UD                1
#define HAVE_UCM_PTMALLOC286      1
#define HAVE_UNISTD_H             1
#define HAVE_VERBS_EXP_H          1
#define HAVE___CLEAR_CACHE        1
#define HAVE___CURBRK             1
#define HAVE___SIGHANDLER_T       1
#define IBV_HW_TM                 1
#define LT_OBJDIR                 ".libs/"
#define NVALGRIND                 1
#define PACKAGE                   "ucx"
#define PACKAGE_BUGREPORT         ""
#define PACKAGE_NAME              "ucx"
#define PACKAGE_STRING            "ucx 1.10"
#define PACKAGE_TARNAME           "ucx"
#define PACKAGE_URL               ""
#define PACKAGE_VERSION           "1.10"
#define STDC_HEADERS              1
#define STRERROR_R_CHAR_P         1
#define UCM_BISTRO_HOOKS          1
#define UCS_MAX_LOG_LEVEL         UCS_LOG_LEVEL_INFO
#define UCT_TCP_EP_KEEPALIVE      1
#define UCT_UD_EP_DEBUG_HOOKS     0
#define UCX_CONFIGURE_FLAGS       "--prefix=/zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/ucx-1.10.0 --enable-profiling --enable-stats --enable-mt --disable-backtrace-detail --without-mpi --enable-mt --enable-cma --without-cuda --without-gdrcopy --with-verbs --with-rdmacm --disable-logging CC=gcc CXX=g++"
#define UCX_MODULE_SUBDIR         "ucx"
#define VERSION                   "1.10"
#define restrict                  __restrict
#define test_MODULES              ":module"
#define ucm_MODULES               ""
#define uct_MODULES               ":ib:rdmacm:cma"
#define uct_cuda_MODULES          ""
#define uct_ib_MODULES            ""
#define uct_rocm_MODULES          ""
#define ucx_perftest_MODULES      ""

yosefe commented 3 years ago

can you pls try setting -mca pml_ucx_progress_iterations 1 ?

devreal commented 3 years ago

I will rerun the experiments with pml_ucx_progress_iterations 1 but I don't see how that can help with the long calls into ucp_put_nbi since there is only one thread calling into MPI and the puts don't require anything else to progress. Will report back :)

yosefe commented 3 years ago

potentially, put_nbi takes longer time because of internal UCX requests memory allocation (also you would see mem usage grows) - because osc/ucx does not progress frequently enough to release completed put_nbi operations. pml_ucx_progress_iterations 1 should make pml/ucx yield more progress cycles for osc/ucx

devreal commented 3 years ago

Sorry for the delay, I had to do some more experiments. First, setting pml_ucx_progress_iterations 1 does not make a difference, the results are the same.

I did run with UCX profiling enabled and had a hard time interpreting the information. Unfortunately, the aggregated information does not provide maximums, only average, which makes it hard to find any outliers. I also ran with logging enabled. Unfortunately, I'm not sure how to interpret the output of ucx_read_prof for log profiles. Assuming that the number following the function name is the runtime in milliseconds, I see several high runtimes between 4 and 6ms:

     0.013  uct_ep_put_zcopy 6.702 {                             rma_basic.c:63   ucp_rma_basic_progress_put()
     6.702  }

This is the maximum I could find in the traces. Unfortunately, it does not say what is happening there.

Is my interpretation of the profiling output correct here? Any chance I could get more detailed information out of a run? I don't mind adding additional instrumentation points but I'm not sure where to put them :)

yosefe commented 3 years ago

Yes, the output interpretation is correct. What UCT transport(s) are being used (can add UCX_LOG_LEVEL=info to print them). I would not expect it will be so long for an IB transport since it doesn't wait for anything

devreal commented 3 years ago

Here is the output of UCX_LOG_LEVEL=info with PaRSEC on 4 nodes. Anything that strikes you?

[1624984398.083715] [r39c2t5n1:22453:0]     ucp_worker.c:1720 UCX  INFO  ep_cfg[0]: tag(self/memory rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.083927] [r39c2t5n3:22452:0]     ucp_worker.c:1720 UCX  INFO  ep_cfg[0]: tag(self/memory rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.083977] [r39c2t5n4:22213:0]     ucp_worker.c:1720 UCX  INFO  ep_cfg[0]: tag(self/memory rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.083983] [r39c2t5n2:22162:0]     ucp_worker.c:1720 UCX  INFO  ep_cfg[0]: tag(self/memory rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.153330] [r39c2t5n4:22213:0]     ucp_worker.c:1720 UCX  INFO  ep_cfg[1]: tag(rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.154154] [r39c2t5n2:22162:0]     ucp_worker.c:1720 UCX  INFO  ep_cfg[1]: tag(rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.156997] [r39c2t5n3:22452:1]     ucp_worker.c:1720 UCX  INFO  ep_cfg[1]: tag(rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.157904] [r39c2t5n1:22453:1]     ucp_worker.c:1720 UCX  INFO  ep_cfg[1]: tag(rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.162001] [r39c2t5n3:22452:0]     ucp_worker.c:1720 UCX  INFO  ep_cfg[2]: tag(rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.171165] [r39c2t5n4:22213:1]     ucp_worker.c:1720 UCX  INFO  ep_cfg[2]: tag(rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.254931] [r39c2t5n1:22453:2]     ucp_worker.c:1720 UCX  INFO  ep_cfg[2]: tag(rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.264687] [r39c2t5n1:22453:2]     ucp_worker.c:1720 UCX  INFO  ep_cfg[0]: rma(self/memory posix/memory sysv/memory); amo(rc_mlx5/mlx5_1:1);
[1624984398.264696] [r39c2t5n4:22213:2]     ucp_worker.c:1720 UCX  INFO  ep_cfg[0]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984398.264683] [r39c2t5n2:22162:1]     ucp_worker.c:1720 UCX  INFO  ep_cfg[0]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984398.264686] [r39c2t5n3:22452:2]     ucp_worker.c:1720 UCX  INFO  ep_cfg[0]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984398.273323] [r39c2t5n1:22453:2]     ucp_worker.c:1720 UCX  INFO  ep_cfg[1]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984398.279303] [r39c2t5n2:22162:1]     ucp_worker.c:1720 UCX  INFO  ep_cfg[1]: rma(self/memory posix/memory sysv/memory); amo(rc_mlx5/mlx5_1:1);
[1624984398.280913] [r39c2t5n2:22162:1]     ucp_worker.c:1720 UCX  INFO  ep_cfg[2]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
#+++++ cores detected       : 32
#+++++ nodes x cores + gpu  : 4 x 32 + 0 (128+0)
#+++++ thread mode          : THREAD_SERIALIZED
#+++++ P x Q                : 2 x 2 (4/4)
#+++++ M x N x K|NRHS       : 30720 x 30720 x 1
#+++++ MB x NB              : 128 x 128
[1624984398.284071] [r39c2t5n3:22452:2]     ucp_worker.c:1720 UCX  INFO  ep_cfg[1]: rma(self/memory posix/memory sysv/memory); amo(rc_mlx5/mlx5_1:1);
[1624984398.285763] [r39c2t5n3:22452:2]     ucp_worker.c:1720 UCX  INFO  ep_cfg[2]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984398.287430] [r39c2t5n4:22213:2]     ucp_worker.c:1720 UCX  INFO  ep_cfg[1]: rma(self/memory posix/memory sysv/memory); amo(rc_mlx5/mlx5_1:1);
[****] TIME(s)      3.66817 : dpotrf    PxQ=   2 2   NB=  128 N=   30720 :    2634.596259 gflops - ENQ&PROG&DEST      3.66848 :    2634.377315 gflops - ENQ      0.00006 - DEST      0.00024
[1624984402.377128] [r39c2t5n3:22452:0]     ucp_worker.c:1720 UCX  INFO  ep_cfg[0]: rma(rc_mlx5/mlx5_0:1); amo(rc_mlx5/mlx5_0:1);
[1624984402.377731] [r39c2t5n2:22162:0]     ucp_worker.c:1720 UCX  INFO  ep_cfg[0]: rma(rc_mlx5/mlx5_0:1); amo(rc_mlx5/mlx5_0:1);
[1624984402.383505] [r39c2t5n4:22213:0]     ucp_worker.c:1720 UCX  INFO  ep_cfg[0]: rma(rc_mlx5/mlx5_0:1); amo(rc_mlx5/mlx5_0:1);
[1624984402.385741] [r39c2t5n1:22453:1]     ucp_worker.c:1720 UCX  INFO  ep_cfg[2]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984402.386517] [r39c2t5n3:22452:0]     ucp_worker.c:1720 UCX  INFO  ep_cfg[1]: rma(rc_mlx5/mlx5_0:1 posix/memory sysv/memory); amo(rc_mlx5/mlx5_0:1);
[1624984402.386587] [r39c2t5n2:22162:0]     ucp_worker.c:1720 UCX  INFO  ep_cfg[1]: rma(rc_mlx5/mlx5_0:1 posix/memory sysv/memory); amo(rc_mlx5/mlx5_0:1);
[1624984402.388995] [r39c2t5n1:22453:0]     ucp_worker.c:1720 UCX  INFO  ep_cfg[0]: rma(rc_mlx5/mlx5_0:1 posix/memory sysv/memory); amo(rc_mlx5/mlx5_0:1);
[1624984402.391044] [r39c2t5n3:22452:0]     ucp_worker.c:1720 UCX  INFO  ep_cfg[3]: rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory); amo(rc_mlx5/mlx5_1:1);
[1624984402.391484] [r39c2t5n2:22162:2]     ucp_worker.c:1720 UCX  INFO  ep_cfg[3]: rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory); amo(rc_mlx5/mlx5_1:1);
[1624984402.393319] [r39c2t5n2:22162:2]     ucp_worker.c:1720 UCX  INFO  ep_cfg[4]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984402.393740] [r39c2t5n4:22213:1]     ucp_worker.c:1720 UCX  INFO  ep_cfg[2]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984402.395155] [r39c2t5n3:22452:1]     ucp_worker.c:1720 UCX  INFO  ep_cfg[4]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984402.397034] [r39c2t5n4:22213:0]     ucp_worker.c:1720 UCX  INFO  ep_cfg[1]: rma(rc_mlx5/mlx5_0:1 posix/memory sysv/memory); amo(rc_mlx5/mlx5_0:1);
[1624984402.397423] [r39c2t5n1:22453:0]     ucp_worker.c:1720 UCX  INFO  ep_cfg[1]: rma(rc_mlx5/mlx5_0:1); amo(rc_mlx5/mlx5_0:1);
[1624984402.401135] [r39c2t5n4:22213:1]     ucp_worker.c:1720 UCX  INFO  ep_cfg[3]: rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory); amo(rc_mlx5/mlx5_1:1);
[1624984402.402516] [r39c2t5n1:22453:0]     ucp_worker.c:1720 UCX  INFO  ep_cfg[3]: rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory); amo(rc_mlx5/mlx5_1:1);

devreal commented 3 years ago

I added some more instrumentation points into the UCX code but the results are more confusing than helpful:

     0.033  uct_ep_put_zcopy 4.938 {                             rma_basic.c:63   ucp_rma_basic_progress_put()
     0.020      UCT_DC_MLX5_CHECK_RES 0.053 {                   dc_mlx5_ep.c:476  uct_dc_mlx5_ep_put_zcopy()
     0.053      }
     0.007      UCT_DC_MLX5_IFACE_TXQP_GET 0.020 {              dc_mlx5_ep.c:477  uct_dc_mlx5_ep_put_zcopy()
     0.020      }
     0.007      uct_rc_mlx5_ep_fence_put 0.006 {                dc_mlx5_ep.c:478  uct_dc_mlx5_ep_put_zcopy()
     0.006      }
     0.007      uct_dc_mlx5_iface_zcopy_post 0.033 {            dc_mlx5_ep.c:481  uct_dc_mlx5_ep_put_zcopy()
     0.033      }
     4.740      UCT_TL_EP_STAT_OP 0.020 {                       dc_mlx5_ep.c:486  uct_dc_mlx5_ep_put_zcopy()
     0.020      }
     0.026  }

Note that the individual calls take only a few microseconds but the offset of UCT_TL_EP_STAT_OP is 4.74ms (if my interpretation is correct). I don't see anything happening in between the two functions so the only explanation I can come up with right now is that it's a scheduling issue. There is no oversubscription on the node though, will have to investigate that further.

I am seeing similar long runtimes for other functions in the traces I have collected (ucs_pgtable_lookup for example). Maybe I am chasing the wrong issue here (although the original observation of mixing RMA and P2P leading to poor performance is consistent)

yosefe commented 3 years ago

@devreal can you try using UCX build in release mode (it will not support profiling, but will make UCT_TL_EP_STAT_OP a NULL operation)?

devreal commented 3 years ago

@yosefe With release mode, do you mean --disable-debug and --disable-profiling?

yosefe commented 3 years ago

yeah, can use the script ./contrib/configure-release (wrapper for ./configure) which turns off all the debug stuff

devreal commented 3 years ago

There is no difference in release mode: I see the same behavior of the version using RMA versus the ones not using RMA.

devreal commented 3 years ago

Is my interpretation of the trace data correct that there is unaccounted time between uct_dc_mlx5_iface_zcopy_post and UCT_TL_EP_STAT_OP?

hjelmn commented 3 years ago

Can you try the same with btl/uct and osc/rdma? osc/ucx still has some serious issues that need to be resolved.

devreal commented 3 years ago

I am having a hard time using osc/rdma on that machine. I'm seeing a segfault in ompi_osc_rdma_lock_all_atomic (if that's fixed in latest master I will try to update and run again)

hjelmn commented 3 years ago

Hmmm. I can take a look. It was working. Not sure what could have broken.

hjelmn commented 3 years ago

Though I did add support for 1.10 in master. Not sure what difference that will make.

devreal commented 3 years ago

Running with osc/rdma without UCX support (probably tcp instead) I get a Segfault in MPI_Win_lock_all:

#5 main () (at 0x000000000040119d)
#4 PMPI_Win_lock_all () at build/ompi/mpi/c/profile/pwin_lock_all.c:56 (at 0x0000155554ebe9f3)
#3 ompi_osc_rdma_lock_all_atomic (mpi_assert=<optimized out>, win=0x60e260) at ompi/mca/osc/rdma/osc_rdma_passive_target.c:355 (at 0x0000155554fd9b02)
#2 ompi_osc_rdma_lock_acquire_shared () at ompi/mca/osc/rdma/osc_rdma_lock.h:311 (at 0x0000155554fd7a3c)
#1 ompi_osc_rdma_lock_btl_fop (wait_for_completion=true, result=0x7fffffffa128, operand=<optimized out>, op=1, address=<optimized out>, peer=0x8b7f50, module=0x155555301320) at ompi/mca/osc/rdma/osc_rdma_lock.h:113 (at 0x0000155554fd7a3c)
#0 ompi_osc_rdma_btl_fop (cbcontext=0x0, cbdata=0x0, cbfunc=0x0, wait_for_completion=true, result=0x7fffffffa128, flags=0, operand=<optimized out>, op=1, address_handle=<optimized out>, address=<optimized out>, endpoint=<optimized out>, btl_index=<optimized out>, module=0x155555301320) at ompi/mca/osc/rdma/osc_rdma_lock.h:75 (at 0x0000155554fd7a3c)

The issue is that selected_btl is NULL.

If I enable the btl/ucx by setting --mca btl_uct_memory_domains all (seems to be required for btl/ucx to work) I get an assert in a simple test app I'm using:

#20 main () (at 0x000000000040120c)
#19 PMPI_Rget () at build/ompi/mpi/c/profile/prget.c:82 (at 0x0000155554eae4b3)
#18 ompi_osc_rdma_rget (origin_addr=0xa6f070, origin_count=1000000, origin_datatype=0x4042a0, source_rank=<optimized out>, source_disp=0, source_count=1000000, source_datatype=0x4042a0, win=0x772d20, request=0x7fffffffa1e8) at ompi/mca/osc/rdma/osc_rdma_comm.c:909 (at 0x0000155554fc348b)
#17 ompi_osc_rdma_module_sync_lookup (peer=0x7fffffffa0d0, target=1, module=0x9f90b0) at ompi/mca/osc/rdma/osc_rdma.h:505 (at 0x0000155554fc348b)
#16 ompi_osc_rdma_module_peer (peer_id=1, module=0x9f90b0) at ompi/mca/osc/rdma/osc_rdma.h:364 (at 0x0000155554fc348b)
#15 ompi_osc_rdma_peer_lookup (module=module@entry=0x9f90b0, peer_id=peer_id@entry=1) at ompi/mca/osc/rdma/osc_rdma_peer.c:312 (at 0x0000155554fdaeb1)
#14 ompi_osc_rdma_peer_lookup_internal (peer_id=1, module=0x9f90b0) at ompi/mca/osc/rdma/osc_rdma_peer.c:288 (at 0x0000155554fdaeb1)
#13 ompi_osc_rdma_peer_setup (module=module@entry=0x9f90b0, peer=0xe46470) at ompi/mca/osc/rdma/osc_rdma_peer.c:161 (at 0x0000155554fda81d)
#12 ompi_osc_get_data_blocking (module=module@entry=0x9f90b0, btl_index=<optimized out>, endpoint=0xe41c40, source_address=source_address@entry=23455759704080, source_handle=source_handle@entry=0x15553835a1b8, data=data@entry=0x7fffffffa018, len=8) at ompi/mca/osc/rdma/osc_rdma_comm.c:121 (at 0x0000155554fc0195)
#11 ompi_osc_rdma_progress (module=0x155554fbaf60) at ompi/mca/osc/rdma/osc_rdma.h:409 (at 0x0000155554fc0195)
#10 opal_progress () at opal/runtime/opal_progress.c:224 (at 0x00001555544b5383)
#9 mca_btl_uct_component_progress () at opal/mca/btl/uct/btl_uct_component.c:623 (at 0x0000155554559dcf)
#8 mca_btl_uct_tl_progress (starting_index=<optimized out>, tl=<optimized out>) at opal/mca/btl/uct/btl_uct_component.c:566 (at 0x0000155554559dcf)
#7 mca_btl_uct_tl_progress (tl=<optimized out>, starting_index=<optimized out>) at opal/mca/btl/uct/btl_uct_component.c:572 (at 0x000015555455999e)
#6 mca_btl_uct_context_progress (context=0x62a970) at opal/mca/btl/uct/btl_uct_device_context.h:168 (at 0x000015555455999e)
#5 mca_btl_uct_device_handle_completions (dev_context=<optimized out>) at opal/mca/btl/uct/btl_uct_device_context.h:150 (at 0x000015555455999e)
#4 ompi_osc_get_data_complete (btl=<optimized out>, endpoint=<optimized out>, local_address=<optimized out>, local_handle=<optimized out>, context=<optimized out>, data=<optimized out>, status=-1) at ompi/mca/osc/rdma/osc_rdma_comm.c:49 (at 0x0000155554fbaf94)
#3 ompi_osc_get_data_complete (btl=<optimized out>, endpoint=<optimized out>, local_address=<optimized out>, local_handle=<optimized out>, context=<optimized out>, data=<optimized out>, status=-1) at ompi/mca/osc/rdma/osc_rdma_comm.c:53 (at 0x0000155554fbaf94)
#2 __assert_fail () from /lib64/libc.so.6 (at 0x0000155554813de6)
#1 __assert_fail_base.cold.0 () from /lib64/libc.so.6 (at 0x0000155554805b09)
#0 abort () from /lib64/libc.so.6 (at 0x0000155554805b0e)

I don't have the resources at this time to debug either of these issues...

devreal commented 3 years ago

Here are two more interesting data points running the same PaRSEC/DPLASMA benchmarks with MPT and MPICH:

1) MPT (HPE's MPI implementation) does not use UCX but sits on ibverbs directly:

dpotrf_1277742-01

The lines are identical for all three variants. Hence: No UCX, no problem.

2) Running on MPICH (recent master) built against UCX:

dpotrf_1279492-01

Here we see the exact same pattern as with Open MPI. This to me suggests that the problem is neither in the pml/ucx nor osc/ucx of Open MPI but inside of UCX.

@yosefe Any more ideas how to track this down?

brminich commented 3 years ago

@devreal, can you please try to rerun the test with UCX_RC_TX_POLL_ALWAYS=y? This will help to check the theory that originator is flooded by the unexpected messages.

devreal commented 3 years ago

@brminich

can you please try to rerun the test with UCX_RC_TX_POLL_ALWAYS=y? This will help to check the theory that originator is flooded by the unexpected messages.

Thanks for the suggestion. Setting UCX_RC_TX_POLL_ALWAYS=y does help somewhat (~50% speedup) but the performance is an order of magnitude lower than without RMA.

One observation that I am not sure became clear earlier: performance actually drops further for larger tile sizes (with constant matrix size), which a) reduces the number of messages/transfers and b) increases the per-transfer size. That too seems to be an argument against the theory that a flood of messages is causing the issue.

Any other UCX env variables I could try? There are too many to tinker with all of them but I sure can do a sweep over a selected subset. I simply cannot judge which ones are worth a try...

brminich commented 3 years ago

@devreal, is this MPI extension publicly available, so I could try to reproduce it locally?

brminich commented 3 years ago

another difference is that we support multirail for p2tp operations while we do not support it for RMA yet. How many HCAs per node do you have? If you have more than 1 mlnx device, can you please set UCX_MAX_RNDV_RAILS=1 and try to run the app without RMA? Would it match RMA results?

devreal commented 3 years ago

@brminich I tried running both benchmarks UCX_MAX_RNDV_RAILS=1 and it didn't make any difference.

There is no need to use the experimental MPI feature I'm developing. All of the measurements here were done using a branch of the PaRSEC runtime that mirrors comminication, either through another pair of send/recv or through a put into a window. The benchmark itself is the Cholesky factorization test of DPLASMA.

I set up a DPLASMA branch that contains PaRSEC with the RMA-mirror. You can get it with:

git clone --recursive -b dplasma_rma_parsec https://bitbucket.org/schuchart/dplasma/ dplasma_git_schuchart_fakerma

To build:

$ mkdir build
$ cd build
$ cmake .. -DDPLASMA_PRECISIONS=d

You'll need the MKL to be available (by setting MKLROOT) and HWLOC (if not in your default search path, you can point CMake to it through -DHWLOC_ROOT=...).

You can then run the benchmark through:

# the per-node matrix size
NPN=$((15*1024))
# total matrix size
N=$(($P*$NPN))
# compute square process grid
Q=$((${NUM_NODES} / $(echo "scale=0; sqrt ( ${NUM_NODES} ) " | bc -l)))
P=$((${NUM_NODES} / $Q))
# tile size
NB=128
mpirun -n $NUM_NODES build/tests/testing_dpotrf -N $N -v -c $NTHREADS -NB $NB -P $P -Q $Q

On Hawk, I'm running with NTHREADS=32 with one process per node. Let me know if you need any support in building or running the benchmark.

brminich commented 3 years ago

@devreal, how do you specify that you want ro run parsec p2p or parsec rma?

devreal commented 3 years ago

@brminich Sorry, forgot to mention that. The version in the branch I gave you contains the RMA version. The other variants are in different branches. They are all ABI compatible and can be simply interposed by setting LD_LIBRARY_PATH when running the DPLASMA benchmark. I didn't have the time to combine them into one before going on vacation.

If you want to compare with the duplicated send/recv, you can use the following PaRSEC branch:

$ git clone --branch topic/revamp_comm_engine_symbols_doublesend https://bitbucket.org/schuchart/parsec.git parsec_doublesend

The baseline (without duplicated traffic) would be:

$ git clone --branch topic/revamp_comm_engine_symbols https://bitbucket.org/schuchart/parsec.git parsec_vanilla

In both cases you can use a simple cmake && make to build the code, no special arguments to CMake should be necessary if MPI is in your PATH.

brminich commented 3 years ago

@devreal, do you see a noticeable degradation only starting from ~16 nodes? Is there any valid parameters combination which would show the difference on smaller scale? (the smaller scale is the better)

brminich commented 3 years ago

@devreal, ping

devreal commented 3 years ago

@brminich Sorry I was on vacation with no access to the system. I played around with different problem/tile sizes but couldn't reproduce it at lower scales. I also noticed that the problem does not occur before 64 nodes (49 nodes still yield the expected performance). However, at 81 the RMA performance is bad too. There seems to be some threshold.

Is there a hash table or something similar whose size I could tune to see if that has any impact?

brminich commented 3 years ago

@devreal, Do I understand correctly that you see performance drop of RMA based implementation comparing to 2-sided version? If yes, what do you mean by "at 81 the RMA performance is bad too"?

Do you have adapting routing configured on the system? Can you please try to add UCX_IB_AR_ENABLE=y?

devreal commented 3 years ago

@brminich Sorry for the delay, it took me some time to get runs though on the machine...

Do I understand correctly that you see performance drop of RMA based implementation comparing to 2-sided version?

Correct. The experiments are such that the RMA version performs the same send/recv communication as vanilla PaRSEC plus mirroring the traffic through MPI RMA. The send/recv version mirrors the traffic using a second pair of send/recv. The version mirroring traffic through RMA shows poor performance starting at 64 nodes while the version mirroring through send/recv shows the same performance as vanilla PaRSEC.

If yes, what do you mean by "at 81 the RMA performance is bad too"?

Sorry that was a typo: it should have been "81 nodes", meaning that the problem not just occurs on 64 nodes but on scales beyond that.

Do you have adapting routing configured on the system?

It is enabled yes. I don't think I have the powers to change that, unfortunately.

Can you please try to add UCX_IB_AR_ENABLE=y?

If I set that I get the following error:

[1628215239.795079] [r3c3t1n1:343788:0]        ib_mlx5.c:836  UCX  ERROR AR=y was requested for SL=auto, but could not detect AR mask for SLs. Please, set SL manually with AR on mlx5_1:1, SLs with AR support = { <none> }, SLs without AR support = { <none> }

How do I get the SL mask?

brminich commented 3 years ago

@devreal, do you know if it is configured on all SLs? If you do not know exact SLs where AR is configured you can try to find it this way:

On one of compute machines get the GUID of the interface you want to check
Run: sudo ibdiagnet -g ${YOUR_GUID} -r --r_opt=vs
After ibdiagnet is done, grep for en_sl in /var/tmp/ibdiagnet2/ibdiagnet2.ar

Then please use UCX_IB_SL env variable to set specific SL with UCX

omor1 commented 3 years ago

@devreal is this related to the severe performance variation I encountered on Expanse (AMD Rome) with DPLASMA, Open MPI, and UCX? I had reported it to and discussed it with @bosilca; I also encountered similar problems with MPICH and UCX.

devreal commented 3 years ago

@omor1 Not sure what your variations look like, will talk to George. It's unlikely though since this is related to MPI RMA, which is normally not used in DPLASMA.

@brminich Sorry for the delay. Unfortunately, I don't have sudo privileges on that system. I will try to find another system to run the benchmarks on to verify whether this is specific to this machine.

omor1 commented 3 years ago

@devreal I was seeing significant (as much as 4x) run-to-run variation with dgemm; using btl/openib got rid of this massive variance, but I ran into other issues with it (there's a reason it's not maintained/removed). George mentioned he sent this to you several months ago, in late May. I will retest and play around with some of the UCX env vars mentioned to see if they make a difference.

brminich commented 3 years ago

@omor1, is there a corresponding issue for this performance variance?

open-mpi / ompi

Mixing UCX pml and osc performance degradation #9080