Potential memory leak coming from ucs_mpool_hugetlb_malloc during ucp_atomic_op_nbx

krakowski commented 3 years ago

Describe the bug

I use Oracle's Project Panama (which will replace JNI in the near feature) to generate Java bindings for ucp. Right now I am in the process of benchmarking my implementation using JMH. Sending messages using ucp_tag_send_nbx and ucp_stream_send_nbx as well as accessing remote memory using ucp_put_nbx and ucp_get_nbx is working and gives decent results. Continuously calling ucp_atomic_op_nbx on the other hand fills the local RAM in a matter of seconds and therefore crashes the process.

I used jemalloc to see if I can identify which function is responsible for the huge amounts of allocated memory and found that ucs_mpool_hugetlb_malloc could be it.

Here is the top of the report jemalloc generated (before the process was killed).

Total: 62596688095 B
60482653466  96.6%  96.6% 60482653466  96.6% ucs_mpool_hugetlb_malloc
1337982976   2.1%  98.8% 1337982976   2.1% uct_mem_alloc
601234012   1.0%  99.7% 601234012   1.0% AllocateHeap@3ad0d0 (inline)
71788578   0.1%  99.8% 71788578   0.1% MLX5_1.0
66201113   0.1%  99.9% 66201113   0.1% ChunkPool::allocate
 4552007   0.0%  99.9%  4683591   0.0% Arena::Arena (inline)
 4552007   0.0% 100.0% 550811466   0.9% ucp_worker_create
 3733020   0.0% 100.0%  3733020   0.0% AllocateHeap@3ad070
 3053808   0.0% 100.0%  3053808   0.0% _verbs_init_and_alloc_context
 2928874   0.0% 100.0% 74717453   0.1% mlx5dv_open_device
 2783848   0.0% 100.0%  2783848   0.0% init
 2525931   0.0% 100.0%  2525931   0.0% uct_rc_iface_fc_handler
 1938982   0.0% 100.0%  1938982   0.0% _GLOBAL__sub_I_eh_alloc.cc
 1678786   0.0% 100.0% 10888470   0.0% uct_rc_verbs_iface_post_recv_always
 1629517   0.0% 100.0%  1629517   0.0% inflatePrime
 1533780   0.0% 100.0%  1533780   0.0% ucs_twheel_init
 1498845   0.0% 100.0% 11188275   0.0% uct_ud_verbs_qp_max_send_sge
 1487317   0.0% 100.0%  1487317   0.0% Unsafe_AllocateMemory0
 1487317   0.0% 100.0%  1487317   0.0% numa_node_to_cpus
 1443905   0.0% 100.0%  1443905   0.0% allocate_dtv

If you are interested in the Java code, here are the relevant parts called during the Benchmark.

All parameters passed to ucp_tag_send_nbx are allocated exactly once during the benchmark.

Steps to Reproduce

Perform the following steps in a loop

Call ucp_atomic_op_nbx with UCP_ATOMIC_OP_ADD
Wait for the request to finish by polling ucp_request_check_status
Release the request with ucp_request_free

Operating System

Operating System : CentOS Linux release 8.1.1911 (Core)
Kernel : 4.18.0-277.el8.x86_64

Packages

rdma-core-32.0-4.el8.x86_64
ucx-1.9.0-1.el8.x86_64
ucx-ib-1.9.0-1.el8.x86_64
ucx-rdmacm-1.9.0-1.el8.x86_64
ucx-cma-1.9.0-1.el8.x86_64
ucx-devel-1.9.0-1.el8.x86_64

Hardware

ibstat

CA 'mlx5_0'
    CA type: MT4119
    Number of ports: 1
    Firmware version: 16.29.2002
    Hardware version: 0
    Node GUID: 0x0c42a10300547792
    System image GUID: 0x0c42a10300547792
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 56
        Base lid: 197
        LMC: 0
        SM lid: 8
        Capability mask: 0x2659e848
        Port GUID: 0x0c42a10300547792
        Link layer: InfiniBand

Additional information

ucx_info -d

#
# Memory domain: posix
#     Component: posix
#             allocate: unlimited
#           remote key: 24 bytes
#           rkey_ptr is supported
#
#   Transport: posix
#      Device: memory
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#
#   Transport: sysv
#      Device: memory
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#   Transport: self
#      Device: memory
#
#      capabilities:
#            bandwidth: 0.00/ppn + 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#   Transport: tcp
#      Device: eno1
#
#      capabilities:
#            bandwidth: 113.16/ppn + 0.00 MB/sec
#              latency: 5776 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
#   Transport: tcp
#      Device: ib0
#
#      capabilities:
#            bandwidth: 6239.81/ppn + 0.00 MB/sec
#              latency: 5210 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
#
# Connection manager: tcp
#      max_conn_priv: 2032 bytes
#
# Memory domain: sockcm
#     Component: sockcm
#           supports client-server connection establishment via sockaddr
#   < no supported devices found >
#
# Memory domain: i40iw0
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#   < no supported devices found >
#
# Memory domain: i40iw1
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#   < no supported devices found >
#
# Memory domain: mlx5_0
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#
#   Transport: rc_verbs
#      Device: mlx5_0:1
#
#      capabilities:
#            bandwidth: 6433.22/ppn + 0.00 MB/sec
#              latency: 700 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 3 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 3 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 2 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 38
#     device num paths: 1
#              max eps: 256
#       device address: 3 bytes
#           ep address: 17 bytes
#       error handling: peer failure
#
#
#   Transport: rc_mlx5
#      Device: mlx5_0:1
#
#      capabilities:
#            bandwidth: 6433.22/ppn + 0.00 MB/sec
#              latency: 700 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 186
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to ep
#      device priority: 38
#     device num paths: 1
#              max eps: 256
#       device address: 3 bytes
#           ep address: 7 bytes
#       error handling: buffer (zcopy), remote access, peer failure
#
#
#   Transport: dc_mlx5
#      Device: mlx5_0:1
#
#      capabilities:
#            bandwidth: 6433.22/ppn + 0.00 MB/sec
#              latency: 760 nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 11 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 11 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 138
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 38
#     device num paths: 1
#              max eps: inf
#       device address: 3 bytes
#        iface address: 5 bytes
#       error handling: buffer (zcopy), remote access, peer failure
#
#
#   Transport: ud_verbs
#      Device: mlx5_0:1
#
#      capabilities:
#            bandwidth: 6433.22/ppn + 0.00 MB/sec
#              latency: 730 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 1 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 3952
#           connection: to ep, to iface
#      device priority: 38
#     device num paths: 1
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure
#
#
#   Transport: ud_mlx5
#      Device: mlx5_0:1
#
#      capabilities:
#            bandwidth: 6433.22/ppn + 0.00 MB/sec
#              latency: 730 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 38
#     device num paths: 1
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure
#
#
# Memory domain: rdmacm
#     Component: rdmacm
#           supports client-server connection establishment via sockaddr
#   < no supported devices found >
#
# Connection manager: rdmacm
#      max_conn_priv: 54 bytes
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#
#   Transport: cma
#      Device: memory
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 400 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: none
#

ucx_info -b

#define UCX_CONFIG_H              
#define ENABLE_BUILTIN_MEMCPY     1
#define ENABLE_DEBUG_DATA         0
#define ENABLE_MT                 0
#define ENABLE_PARAMS_CHECK       0
#define ENABLE_SYMBOL_OVERRIDE    1
#define HAVE_1_ARG_BFD_SECTION_SIZE 0
#define HAVE_ALLOCA               1
#define HAVE_ALLOCA_H             1
#define HAVE_ATTRIBUTE_NOOPTIMIZE 1
#define HAVE_CLEARENV             1
#define HAVE_CPU_SET_T            1
#define HAVE_DC_DV                1
#define HAVE_DECL_ASPRINTF        1
#define HAVE_DECL_BASENAME        1
#define HAVE_DECL_BFD_GET_SECTION_FLAGS 0
#define HAVE_DECL_BFD_GET_SECTION_VMA 0
#define HAVE_DECL_BFD_SECTION_FLAGS 0
#define HAVE_DECL_BFD_SECTION_VMA 0
#define HAVE_DECL_CPU_ISSET       1
#define HAVE_DECL_CPU_ZERO        1
#define HAVE_DECL_ETHTOOL_CMD_SPEED 1
#define HAVE_DECL_FMEMOPEN        1
#define HAVE_DECL_F_SETOWN_EX     1
#define HAVE_DECL_IBV_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_ACCESS_RELAXED_ORDERING 1
#define HAVE_DECL_IBV_ADVISE_MR   1
#define HAVE_DECL_IBV_ALLOC_DM    1
#define HAVE_DECL_IBV_ALLOC_TD    1
#define HAVE_DECL_IBV_CMD_MODIFY_QP 0
#define HAVE_DECL_IBV_CREATE_CQ_ATTR_IGNORE_OVERRUN 1
#define HAVE_DECL_IBV_CREATE_QP_EX 1
#define HAVE_DECL_IBV_CREATE_SRQ  1
#define HAVE_DECL_IBV_CREATE_SRQ_EX 1
#define HAVE_DECL_IBV_EVENT_GID_CHANGE 1
#define HAVE_DECL_IBV_EVENT_TYPE_STR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ALLOCATE_MR 0
#define HAVE_DECL_IBV_EXP_ACCESS_ON_DEMAND 0
#define HAVE_DECL_IBV_EXP_ALLOC_DM 0
#define HAVE_DECL_IBV_EXP_ATOMIC_HCA_REPLY_BE 0
#define HAVE_DECL_IBV_EXP_CQ_IGNORE_OVERRUN 0
#define HAVE_DECL_IBV_EXP_CQ_MODERATION 0
#define HAVE_DECL_IBV_EXP_CREATE_QP 0
#define HAVE_DECL_IBV_EXP_CREATE_SRQ 0
#define HAVE_DECL_IBV_EXP_DCT_OOO_RW_DATA_PLACEMENT 0
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_PCI_ATOMIC_CAPS 0
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_RESERVED_2 0
#define HAVE_DECL_IBV_EXP_DEVICE_DC_TRANSPORT 0
#define HAVE_DECL_IBV_EXP_DEVICE_MR_ALLOCATE 0
#define HAVE_DECL_IBV_EXP_MR_FIXED_BUFFER_SIZE 0
#define HAVE_DECL_IBV_EXP_MR_INDIRECT_KLMS 0
#define HAVE_DECL_IBV_EXP_ODP_SUPPORT_IMPLICIT 0
#define HAVE_DECL_IBV_EXP_POST_SEND 0
#define HAVE_DECL_IBV_EXP_PREFETCH_MR 0
#define HAVE_DECL_IBV_EXP_PREFETCH_WRITE_ACCESS 0
#define HAVE_DECL_IBV_EXP_QPT_DC_INI 0
#define HAVE_DECL_IBV_EXP_QP_CREATE_UMR 0
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG 0
#define HAVE_DECL_IBV_EXP_QP_OOO_RW_DATA_PLACEMENT 0
#define HAVE_DECL_IBV_EXP_QUERY_DEVICE 0
#define HAVE_DECL_IBV_EXP_QUERY_GID_ATTR 0
#define HAVE_DECL_IBV_EXP_REG_MR  0
#define HAVE_DECL_IBV_EXP_SEND_EXT_ATOMIC_INLINE 0
#define HAVE_DECL_IBV_EXP_SETENV  0
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_CMP_AND_SWP 0
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_FETCH_AND_ADD 0
#define HAVE_DECL_IBV_EXP_WR_NOP  0
#define HAVE_DECL_IBV_GET_ASYNC_EVENT 1
#define HAVE_DECL_IBV_GET_DEVICE_NAME 1
#define HAVE_DECL_IBV_LINK_LAYER_ETHERNET 1
#define HAVE_DECL_IBV_LINK_LAYER_INFINIBAND 1
#define HAVE_DECL_IBV_ODP_SUPPORT_IMPLICIT 1
#define HAVE_DECL_IBV_QPF_GRH_REQUIRED 1
#define HAVE_DECL_IBV_QUERY_DEVICE_EX 1
#define HAVE_DECL_IBV_QUERY_GID   1
#define HAVE_DECL_IBV_WC_STATUS_STR 1
#define HAVE_DECL_MADV_FREE       1
#define HAVE_DECL_MADV_REMOVE     1
#define HAVE_DECL_MLX5DV_CQ_INIT_ATTR_MASK_CQE_SIZE 1
#define HAVE_DECL_MLX5DV_CREATE_QP 1
#define HAVE_DECL_MLX5DV_DCTYPE_DCT 1
#define HAVE_DECL_MLX5DV_DEVX_SUBSCRIBE_DEVX_EVENT 1
#define HAVE_DECL_MLX5DV_INIT_OBJ 1
#define HAVE_DECL_MLX5DV_IS_SUPPORTED 1
#define HAVE_DECL_MLX5DV_OBJ_AH   1
#define HAVE_DECL_MLX5DV_QP_CREATE_ALLOW_SCATTER_TO_CQE 1
#define HAVE_DECL_POSIX_MADV_DONTNEED 1
#define HAVE_DECL_PR_SET_PTRACER  1
#define HAVE_DECL_RDMA_ESTABLISH  1
#define HAVE_DECL_RDMA_INIT_QP_ATTR 1
#define HAVE_DECL_SPEED_UNKNOWN   1
#define HAVE_DECL_STRERROR_R      1
#define HAVE_DECL_SYS_BRK         1
#define HAVE_DECL_SYS_IPC         0
#define HAVE_DECL_SYS_MADVISE     1
#define HAVE_DECL_SYS_MMAP        1
#define HAVE_DECL_SYS_MREMAP      1
#define HAVE_DECL_SYS_MUNMAP      1
#define HAVE_DECL_SYS_SHMAT       1
#define HAVE_DECL_SYS_SHMDT       1
#define HAVE_DECL___PPC_GET_TIMEBASE_FREQ 0
#define HAVE_DEVX                 1
#define HAVE_DLFCN_H              1
#define HAVE_HW_TIMER             1
#define HAVE_IB                   1
#define HAVE_IBV_DM               1
#define HAVE_IN6_ADDR_S6_ADDR32   1
#define HAVE_INFINIBAND_MLX5DV_H  1
#define HAVE_INFINIBAND_TM_TYPES_H 1
#define HAVE_INTTYPES_H           1
#define HAVE_IP_IP_DST            1
#define HAVE_LIBGEN_H             1
#define HAVE_LIBRT                1
#define HAVE_LINUX_FUTEX_H        1
#define HAVE_LINUX_IP_H           1
#define HAVE_LINUX_MMAN_H         1
#define HAVE_MALLOC_H             1
#define HAVE_MALLOC_HOOK          1
#define HAVE_MALLOC_TRIM          1
#define HAVE_MEMALIGN             1
#define HAVE_MEMORY_H             1
#define HAVE_MLX5_HW              1
#define HAVE_MLX5_HW_UD           1
#define HAVE_MREMAP               1
#define HAVE_NETINET_IP_H         1
#define HAVE_NET_ETHERNET_H       1
#define HAVE_NUMA                 1
#define HAVE_NUMAIF_H             1
#define HAVE_NUMA_H               1
#define HAVE_ODP                  1
#define HAVE_ODP_IMPLICIT         1
#define HAVE_POSIX_MEMALIGN       1
#define HAVE_PREFETCH             1
#define HAVE_RDMACM_QP_LESS       1
#define HAVE_SCHED_GETAFFINITY    1
#define HAVE_SCHED_SETAFFINITY    1
#define HAVE_SIGACTION_SA_RESTORER 1
#define HAVE_SIGEVENT_SIGEV_UN_TID 1
#define HAVE_SIGHANDLER_T         1
#define HAVE_STDINT_H             1
#define HAVE_STDLIB_H             1
#define HAVE_STRERROR_R           1
#define HAVE_STRINGS_H            1
#define HAVE_STRING_H             1
#define HAVE_STRUCT_BITMASK       1
#define HAVE_STRUCT_DL_PHDR_INFO  1
#define HAVE_STRUCT_IBV_DEVICE_ATTR_EX_PCI_ATOMIC_CAPS 1
#define HAVE_STRUCT_IBV_TM_CAPS_FLAGS 1
#define HAVE_STRUCT_MLX5DV_CQ_CQ_UAR 1
#define HAVE_SYS_EPOLL_H          1
#define HAVE_SYS_EVENTFD_H        1
#define HAVE_SYS_STAT_H           1
#define HAVE_SYS_TYPES_H          1
#define HAVE_SYS_UIO_H            1
#define HAVE_TL_DC                1
#define HAVE_TL_RC                1
#define HAVE_TL_UD                1
#define HAVE_UCM_PTMALLOC286      1
#define HAVE_UNISTD_H             1
#define HAVE___CLEAR_CACHE        1
#define HAVE___CURBRK             1
#define HAVE___SIGHANDLER_T       1
#define IBV_HW_TM                 1
#define LT_OBJDIR                 ".libs/"
#define NVALGRIND                 1
#define PACKAGE                   "ucx"
#define PACKAGE_BUGREPORT         ""
#define PACKAGE_NAME              "ucx"
#define PACKAGE_STRING            "ucx 1.9"
#define PACKAGE_TARNAME           "ucx"
#define PACKAGE_URL               ""
#define PACKAGE_VERSION           "1.9"
#define STDC_HEADERS              1
#define STRERROR_R_CHAR_P         1
#define UCM_BISTRO_HOOKS          1
#define UCS_MAX_LOG_LEVEL         UCS_LOG_LEVEL_INFO
#define UCT_UD_EP_DEBUG_HOOKS     0
#define UCX_CONFIGURE_FLAGS       "--build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-examples --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --without-cm --without-knem --with-rdmacm --without-rocm --without-xpmem --without-ugni"
#define UCX_MODULE_SUBDIR         "ucx"
#define VERSION                   "1.9"
#define restrict                  __restrict
#define test_MODULES              ":module"
#define ucm_MODULES               ""
#define uct_MODULES               ":ib:rdmacm:cma"
#define uct_cuda_MODULES          ""
#define uct_ib_MODULES            ""
#define uct_rocm_MODULES          ""
#define ucx_perftest_MODULES      ""

yosefe commented 3 years ago

@krakowski is it possible there are many unexpected tagged messages which are not being received? it's possible to heck this by probing with tag_mask=0 : call it repeatedly drain the unexpected queue.

yosefe commented 3 years ago

Also, it is possible some send/recv requests are not being released? Is there any warning about leaked objects during ucp_worker_destroy()?

krakowski commented 3 years ago

Hi @yosefe,

Running the benchmark for a short amount of time works and no warnings about leaked resources are logged. Before running the benchmark I exchange some messages for setting up the environment and synchronization, but everything runs in a single thread, so these calls should not be reached a second time once setup (connection establishment and exchange of remote keys) is finished.

Does ucp_atomic_op_nbx also send tagged messages I have to poll or are there any other resources (except for the request, if it is one) I have to release after this function call?

yosefe commented 3 years ago

ucp_atomic_op_nbx does not send a tagged message.. can you double check that request_free is actually called the expected # of times? another way - build ucx with debug mode (./contrib/configure-devel) and upload log file with UCX_LOG_LEVEL=data. Then we can see the lifetime of all objects and why they are not destroyed.

krakowski commented 3 years ago

Just checked this. The pointers I get from ucp_atomic_op_nbx are all UCS_OK (null pointers). From what I read this means that the request finished immediately. Do I have to call ucp_request_free on them, too? At the moment I check if the returned value/pointer is within the range of a status and only call ucp_request_free on it if it isn't.

Edit

Just tried it. Calling ucp_request_free on them results in a segfault.

yosefe commented 3 years ago

I think the issue is that too many requests are being queued internally. Will try to provide a fix for it which will create some backpressure (so at some point, will return a request instead of UCS_OK)

krakowski commented 3 years ago

I can check this by waiting some amount of time between calls to ucp_atomic_op_nbx and see if the memory consumption reduces. Will report my results soon.

Do you still need the log output with debug mode? I could try to build the RPM packages in debug mode and run my benchmark with UCX_LOG_LEVEL=data.

yosefe commented 3 years ago

I can check this by waiting some amount of time between calls to ucp_atomic_op_nbx and see if the memory consumption reduces. Will report my results soon.

Do you still need the log output with debug mode? I could try to build the RPM packages in debug mode and run my benchmark with UCX_LOG_LEVEL=data.

No need for now. As another experiment, i'd suggest to try a blocking {ucp_ep_flush_nbx+poll_wait} every 1k iterations or so.

krakowski commented 3 years ago

@yosefe Your suggestion is working! :)

Calling ucp_ep_flush_nbx and waiting for it to complete after every 1000 invocations of ucp_atomic_op_nbx keeps the memory consumption steady. I also got better performance results (~0.95 us per atomic add) presumably by preventing new memory allocations.

Since I am already talking to someone knowing how ucx works I would like to ask another off-topic question, if this is ok. Is it normal that RDMA reads are a lot slower than RDMA writes? With a buffer size of 64 bytes I get ~0.2us for writes while reads take ~2.0us (until ucp_request_check_status reports UCS_OK).

shamisp commented 3 years ago

Writes completed locally the moment the user data is copied to the device or local bouncing buffer. For read to be completed (aka data delivered to the user) you have to wait for a full roundtrip.

krakowski commented 3 years ago

@shamisp Thanks for the explanation! :)

I already thought something along those lines. Is it then somehow possible to measure write latency at the initiators side without adding extra overhead? Since the receiver is not aware of the RDMA write, I can only think of polling the memory region for changes and sending a message back if a change is detected. Unfortunately, this would add some considerable amount of overhead to the measurement.

shamisp commented 3 years ago

@krakowski We already have such test implemented (for both uct/ucp layers) in ucx_perftest. UCT/UCP put latency test measures half round trip latency using ping-pong protocol, which is similar to what you describe.

krakowski commented 3 years ago

@shamisp Thanks, I will take a look at it and try to port it. I should have mentioned that I am using a custom Java binding for UCX which I want to benchmark. This is why I search for a way to reliably measure the latency :slightly_smiling_face:

shamisp commented 3 years ago

Adding @petro-rudenko who implemented the Java binding. Maybe he has some benchmarks to share.

openucx / ucx