openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.13k stars 423 forks source link

Assertion `uct_tcp_cm_ep_accept_conn(ep)' failed: stack trace #9547

Closed ziegenbalg closed 7 months ago

ziegenbalg commented 9 months ago

Describe the bug

UCX deployed using the Intel Daos fileysystem. Upon running 'daos pool quqery tank' I get the following stack trace.

Steps to Reproduce

Setup and versions

Additional information (depending on the issue)

Attached files packages.list.txt error.trace.txt

yosefe commented 9 months ago

@ziegenbalg could you pls provide the following information:

  1. System details on 10.0.0.19 and 73.93.84.167 machines
  2. Output of ucx_info -vdepwb -u a on each of these machines
  3. Is is possible to get remove access to debug the issue?
ziegenbalg commented 9 months ago

@yosefe, sorry for the delay. Here's the requested info

  1. System details

10.0.0.19: AWS micro instance: -bash-5.2# cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 12 (bookworm)" NAME="Debian GNU/Linux" VERSION_ID="12" VERSION="12 (bookworm)" VERSION_CODENAME=bookworm ID=debian

73.93.84.167: Baremetal computer at my home: Debian 12 instance

2.

info from 10.0.0.19:

-bash-5.2# ucx_info -vdepwb -u a
# Version 1.13.1
# Git branch '<unknown>', revision 0000000
# Configured with: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --enable-mt --with-verbs --disable-backtrace-detail --disable-logging --enable-devel-headers --enable-examples --enable-cma
#define UCX_CONFIG_H              
#define ENABLE_ASSERT             1
#define ENABLE_BUILTIN_MEMCPY     1
#define ENABLE_DEBUG_DATA         0
#define ENABLE_MT                 1
#define ENABLE_PARAMS_CHECK       1
#define HAVE_ALLOCA               1
#define HAVE_ALLOCA_H             1
#define HAVE_ATTRIBUTE_NOOPTIMIZE 1
#define HAVE_CLEARENV             1
#define HAVE_CPU_SET_T            1
#define HAVE_DC_DV                1
#define HAVE_DECL_ASPRINTF        1
#define HAVE_DECL_BASENAME        1
#define HAVE_DECL_CPU_ISSET       1
#define HAVE_DECL_CPU_ZERO        1
#define HAVE_DECL_ETHTOOL_CMD_SPEED 1
#define HAVE_DECL_FMEMOPEN        1
#define HAVE_DECL_FUSE_MOUNT      0
#define HAVE_DECL_FUSE_OPEN_CHANNEL 0
#define HAVE_DECL_FUSE_UNMOUNT    0
#define HAVE_DECL_F_SETOWN_EX     1
#define HAVE_DECL_GETAUXVAL       1
#define HAVE_DECL_IBV_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_ACCESS_RELAXED_ORDERING 1
#define HAVE_DECL_IBV_ADVISE_MR   1
#define HAVE_DECL_IBV_ALLOC_DM    1
#define HAVE_DECL_IBV_ALLOC_TD    1
#define HAVE_DECL_IBV_CMD_MODIFY_QP 0
#define HAVE_DECL_IBV_CREATE_CQ_ATTR_IGNORE_OVERRUN 1
#define HAVE_DECL_IBV_CREATE_CQ_EX 1
#define HAVE_DECL_IBV_CREATE_QP_EX 1
#define HAVE_DECL_IBV_CREATE_SRQ  1
#define HAVE_DECL_IBV_CREATE_SRQ_EX 1
#define HAVE_DECL_IBV_EVENT_GID_CHANGE 1
#define HAVE_DECL_IBV_EVENT_TYPE_STR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ALLOCATE_MR 0
#define HAVE_DECL_IBV_EXP_ACCESS_ON_DEMAND 0
#define HAVE_DECL_IBV_EXP_ALLOC_DM 0
#define HAVE_DECL_IBV_EXP_ATOMIC_HCA_REPLY_BE 0
#define HAVE_DECL_IBV_EXP_CQ_IGNORE_OVERRUN 0
#define HAVE_DECL_IBV_EXP_CQ_MODERATION 0
#define HAVE_DECL_IBV_EXP_CREATE_QP 0
#define HAVE_DECL_IBV_EXP_CREATE_SRQ 0
#define HAVE_DECL_IBV_EXP_DCT_OOO_RW_DATA_PLACEMENT 0
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_PCI_ATOMIC_CAPS 0
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_RESERVED_2 0
#define HAVE_DECL_IBV_EXP_DEVICE_DC_TRANSPORT 0
#define HAVE_DECL_IBV_EXP_DEVICE_MR_ALLOCATE 0
#define HAVE_DECL_IBV_EXP_MR_FIXED_BUFFER_SIZE 0
#define HAVE_DECL_IBV_EXP_MR_INDIRECT_KLMS 0
#define HAVE_DECL_IBV_EXP_ODP_SUPPORT_IMPLICIT 0
#define HAVE_DECL_IBV_EXP_POST_SEND 0
#define HAVE_DECL_IBV_EXP_PREFETCH_MR 0
#define HAVE_DECL_IBV_EXP_PREFETCH_WRITE_ACCESS 0
#define HAVE_DECL_IBV_EXP_QPT_DC_INI 0
#define HAVE_DECL_IBV_EXP_QP_CREATE_UMR 0
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG 0
#define HAVE_DECL_IBV_EXP_QP_OOO_RW_DATA_PLACEMENT 0
#define HAVE_DECL_IBV_EXP_QUERY_DEVICE 0
#define HAVE_DECL_IBV_EXP_QUERY_GID_ATTR 0
#define HAVE_DECL_IBV_EXP_REG_MR  0
#define HAVE_DECL_IBV_EXP_SEND_EXT_ATOMIC_INLINE 0
#define HAVE_DECL_IBV_EXP_SETENV  0
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_CMP_AND_SWP 0
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_FETCH_AND_ADD 0
#define HAVE_DECL_IBV_EXP_WR_NOP  0
#define HAVE_DECL_IBV_GET_ASYNC_EVENT 1
#define HAVE_DECL_IBV_GET_DEVICE_NAME 1
#define HAVE_DECL_IBV_LINK_LAYER_ETHERNET 1
#define HAVE_DECL_IBV_LINK_LAYER_INFINIBAND 1
#define HAVE_DECL_IBV_ODP_SUPPORT_IMPLICIT 1
#define HAVE_DECL_IBV_QPF_GRH_REQUIRED 1
#define HAVE_DECL_IBV_QUERY_DEVICE_EX 1
#define HAVE_DECL_IBV_QUERY_GID   1
#define HAVE_DECL_IBV_WC_STATUS_STR 1
#define HAVE_DECL_INOTIFY_ADD_WATCH 1
#define HAVE_DECL_INOTIFY_INIT    1
#define HAVE_DECL_IN_ATTRIB       1
#define HAVE_DECL_IPPROTO_TCP     1
#define HAVE_DECL_MADV_FREE       1
#define HAVE_DECL_MADV_REMOVE     1
#define HAVE_DECL_MLX5DV_CQ_INIT_ATTR_MASK_COMPRESSED_CQE 1
#define HAVE_DECL_MLX5DV_CQ_INIT_ATTR_MASK_CQE_SIZE 1
#define HAVE_DECL_MLX5DV_CREATE_QP 1
#define HAVE_DECL_MLX5DV_DCTYPE_DCT 1
#define HAVE_DECL_MLX5DV_DEVX_SUBSCRIBE_DEVX_EVENT 1
#define HAVE_DECL_MLX5DV_INIT_OBJ 1
#define HAVE_DECL_MLX5DV_IS_SUPPORTED 1
#define HAVE_DECL_MLX5DV_OBJ_AH   1
#define HAVE_DECL_MLX5DV_QP_CREATE_ALLOW_SCATTER_TO_CQE 1
#define HAVE_DECL_MLX5DV_UAR_ALLOC_TYPE_BF 1
#define HAVE_DECL_MLX5DV_UAR_ALLOC_TYPE_NC 1
#define HAVE_DECL_POSIX_MADV_DONTNEED 1
#define HAVE_DECL_PR_SET_PTRACER  1
#define HAVE_DECL_SOL_SOCKET      1
#define HAVE_DECL_SO_KEEPALIVE    1
#define HAVE_DECL_SPEED_UNKNOWN   1
#define HAVE_DECL_STRERROR_R      1
#define HAVE_DECL_SYS_BRK         1
#define HAVE_DECL_SYS_IPC         0
#define HAVE_DECL_SYS_MADVISE     1
#define HAVE_DECL_SYS_MMAP        1
#define HAVE_DECL_SYS_MREMAP      1
#define HAVE_DECL_SYS_MUNMAP      1
#define HAVE_DECL_SYS_SHMAT       1
#define HAVE_DECL_SYS_SHMDT       1
#define HAVE_DECL_TCP_KEEPCNT     1
#define HAVE_DECL_TCP_KEEPIDLE    1
#define HAVE_DECL_TCP_KEEPINTVL   1
#define HAVE_DECL___PPC_GET_TIMEBASE 0
#define HAVE_DECL___PPC_GET_TIMEBASE_FREQ 0
#define HAVE_DEVX                 1
#define HAVE_DLFCN_H              1
#define HAVE_HW_TIMER             1
#define HAVE_IB                   1
#define HAVE_IBV_DM               1
#define HAVE_IN6_ADDR_S6_ADDR32   1
#define HAVE_INFINIBAND_MLX5DV_H  1
#define HAVE_INFINIBAND_TM_TYPES_H 1
#define HAVE_INOTIFY              1
#define HAVE_INTTYPES_H           1
#define HAVE_IP_IP_DST            1
#define HAVE_LIBGEN_H             1
#define HAVE_LIBRT                1
#define HAVE_LINUX_FUTEX_H        1
#define HAVE_LINUX_IP_H           1
#define HAVE_LINUX_MMAN_H         1
#define HAVE_MALLOC_H             1
#define HAVE_MALLOC_TRIM          1
#define HAVE_MEMALIGN             1
#define HAVE_MLX5_DV              1
#define HAVE_MLX5_HW              1
#define HAVE_MLX5_HW_UD           1
#define HAVE_MREMAP               1
#define HAVE_NETINET_IP_H         1
#define HAVE_NET_ETHERNET_H       1
#define HAVE_NUMA                 1
#define HAVE_NUMAIF_H             1
#define HAVE_NUMA_H               1
#define HAVE_ODP                  1
#define HAVE_ODP_IMPLICIT         1
#define HAVE_POSIX_MEMALIGN       1
#define HAVE_PREFETCH             1
#define HAVE_SCHED_GETAFFINITY    1
#define HAVE_SCHED_SETAFFINITY    1
#define HAVE_SIGACTION_SA_RESTORER 1
#define HAVE_SIGEVENT_SIGEV_UN_TID 1
#define HAVE_SIGHANDLER_T         1
#define HAVE_STDINT_H             1
#define HAVE_STDIO_H              1
#define HAVE_STDLIB_H             1
#define HAVE_STRERROR_R           1
#define HAVE_STRINGS_H            1
#define HAVE_STRING_H             1
#define HAVE_STRUCT_BITMASK       1
#define HAVE_STRUCT_DL_PHDR_INFO  1
#define HAVE_STRUCT_IBV_DEVICE_ATTR_EX_PCI_ATOMIC_CAPS 1
#define HAVE_STRUCT_IBV_TM_CAPS_FLAGS 1
#define HAVE_STRUCT_MLX5DV_CQ_CQ_UAR 1
#define HAVE_SYS_EPOLL_H          1
#define HAVE_SYS_EVENTFD_H        1
#define HAVE_SYS_STAT_H           1
#define HAVE_SYS_TYPES_H          1
#define HAVE_SYS_UIO_H            1
#define HAVE_TL_DC                1
#define HAVE_TL_RC                1
#define HAVE_TL_UD                1
#define HAVE_UCM_PTMALLOC286      1
#define HAVE_UNISTD_H             1
#define HAVE_WCHAR_H              1
#define HAVE___CLEAR_CACHE        1
#define HAVE___CURBRK             1
#define HAVE___SIGHANDLER_T       1
#define IBV_HW_TM                 1
#define LT_OBJDIR                 ".libs/"
#define NVALGRIND                 1
#define PACKAGE                   "ucx"
#define PACKAGE_BUGREPORT         ""
#define PACKAGE_NAME              "ucx"
#define PACKAGE_STRING            "ucx 1.13"
#define PACKAGE_TARNAME           "ucx"
#define PACKAGE_URL               ""
#define PACKAGE_VERSION           "1.13"
#define STDC_HEADERS              1
#define STRERROR_R_CHAR_P         1
#define UCM_BISTRO_HOOKS          1
#define UCS_MAX_LOG_LEVEL         UCS_LOG_LEVEL_DEBUG
#define UCT_TCP_EP_KEEPALIVE      1
#define UCT_UD_EP_DEBUG_HOOKS     0
#define UCX_CONFIGURE_FLAGS       "--build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --enable-mt --with-verbs --disable-backtrace-detail --disable-logging --enable-devel-headers --enable-examples --enable-cma"
#define UCX_MODULE_SUBDIR         "ucx"
#define VERSION                   "1.13"
#define restrict                  __restrict__
#define test_MODULES              ":module"
#define ucm_MODULES               ""
#define ucs_MODULES               ""
#define uct_MODULES               ":ib:rdmacm:cma"
#define uct_cuda_MODULES          ""
#define uct_ib_MODULES            ""
#define uct_rocm_MODULES          ""
#define ucx_perftest_MODULES      ""
#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: self
#         Device: memory0
#           Type: loopback
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: tcp
#         Device: enX0
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.82/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 0
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: lo
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#
#      Transport: sysv
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: posix
#     Component: posix
#             allocate: <= 498668K
#           remote key: 24 bytes
#           rkey_ptr is supported
#
#      Transport: posix
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
# < failed to open connection manager rdmacm >
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#
#      Transport: cma
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
#
# UCP context
#
#     component 0  :  self
#     component 1  :  tcp
#     component 2  :  sysv
#     component 3  :  posix
#     component 4  :  ib
#     component 5  :  rdmacm
#     component 6  :  cma
#
#            md 0  :  component 0  self 
#            md 1  :  component 1  tcp 
#            md 2  :  component 2  sysv 
#            md 3  :  component 3  posix 
#            md 4  :  component 6  cma 
#
#      resource 0  :  md 0  dev 0  flags -- self/memory0
#      resource 1  :  md 1  dev 1  flags -- tcp/enX0
#      resource 2  :  md 1  dev 2  flags -- tcp/lo
#      resource 3  :  md 2  dev 3  flags -- sysv/memory
#      resource 4  :  md 3  dev 3  flags -- posix/memory
#      resource 5  :  md 4  dev 3  flags -- cma/memory
#
# memory: 0.00MB, file descriptors: 5
# create time: 2.963 ms
#
#
# UCP worker 'ip-10-0-0-189:721'
#
#                 address: 215 bytes
#                 atomics: 0:self/memory0, 3:sysv/memory, 4:posix/memory
#
# memory: 2.25MB, file descriptors: 8
# create time: 5.484 ms
#
#
# UCP endpoint 
#
#               peer: <no debug data>
#                 lane[0]:  0:self/memory0.0 md[0]          -> md[0]/self/sysdev[255] amo#0
#
#
  1. You can access the daos client (10.0.0.19) here:

ssh admin@35.91.1.68 -i key.pem

-----BEGIN RSA PRIVATE KEY-----
MIIJKAIBAAKCAgEAswwNgjI7zdy7B4Dg77gsce5pFnkJ2rnvIiNTyL++uynGnP67
omO0hYA3YMMyUoTPpVHkiBblE/v35tHJTyceGwq6//jRnxQQFdrI5NxkIjsNMeIu
ZUegXR9vMw83ZP25K221clVEtbPC1xdnXK5lyWgJYa4Cv3+RyTKmyDYnBzYg/qxm
h42Vd9tsa8v2ARY2O9OlNDGDhSxC0U6/QaWAodFRuqKxnyQa3BkWhl60EbZ1Z+Um
/8bvVP3rkIez8uV0jAulw4AraTgi0JzogL0iwqNGkGdxC75eF16GyB5gwddCUOJt
5Yr5OLMSvio+hPY42H8MDufPw3jBLCem8Sach5aZ7a1hRIxgRgly+Nb3l4nKvJ93
dhAnfFUrcr8gOpHthHPfL0lfJt/WbxvwZvPn6P6LC+pNUSZtLY3JCqayq2sD7ibe
A+fK00WYlqdIkQY6wGZsEkK93pAUcdGz0NzwpJ/AyyFlY0hiPFiMVPTzHW6orY9s
pgcl/cuhJbA62W2z9g1PulUcH6iQXFX1F2pjBGeEfQ3nl7orKx1g/l6d40zm6P8V
qTmrQuWAmig5hbADjKPillTtGJxoLL9k7XZQlnJ21Zxevgrl6mCtpgbV5pmOc6Ct
jLj2ljU6jrEATVRuhvwlDcD8zpLZ0sUmGsLTYCmfYoEXTRh24MYjF68dJQ0CAwEA
AQKCAgBlMBSaViVyTKLutKlrER0dLm35o2IsWRSKqIh7VdJHGG8E3vnP5qVpJXMm
nmpcNG7dsZCEUNwaFTafHkS1FRhwk/nnHAnr63zDvSf4H43/wnvMi1VJK9e3iUJh
CuJ1kucJYjE/cm0oe3VL9hAWUwF3d/KFX8Bx1Xjgyl1znRclfjPbejdfuGaWnE2E
Bnr7VkBEQ3KEMs9vffotLGHuichOq2uTHMthcxgy94LBYdhGgmWL1/fYaedskyyE
PA044vChDRwbht4B0xXp7COCTmHMulXw2mULDAVFE8j8ZB6uribua2lMZuQ+ZQXc
rycyIa0yvWplKiFDCYWynJ4+f2HKVMHLp53T7BxL0CQt6H5pEyHWPyN+eWBOE91r
xKVMF2jNtC9w/wdYjnARXLS8wbNuMDD7l8nkvUgDV+29A21tD5f9xW3pcTQRWonu
+6iOE4YVcOrO7gER4AJ6Zdq1O8ByWeBkLwVkwlyovMCv76Tucu3jb0/nKQITlZ2Y
Y63X3wObuU45IPt+xBUEEtdUWuxlgjrquGywJLfx6o56Xu0YfpE9wh8yf2wj4wHm
8bLvTGYepG5u96V93KTbWEHypn45lOfGZ9ifKG6atbVzfI9BA5RoVldP/j+wu/zR
vA2a7qG6Tlh/IJAyIRh7t8KQTmFJR+hCyagxD6xIUZhWAXw1CQKCAQEAyy2fM0fN
9dZBYAScr3ZmUcD6byMuWjfJoNuUDIzunGKGta64QcKEUvD8r5csvHAZSH8YBWCD
RuRJEG4Av3JZT0u6ndz+cY92JGS6/utjX1PPQsN2mnvlSXK58DuySWSVLwn4FVRZ
cUNuNfPGtutyjPqAVmhl4URJu/FSEqyxh6EMRpg+tR++DaY/3047brY+5/u6kb5x
1kJzLGX6m3WUU2TUbarzGHOpmFNDwF8siN+r+52mXocnu8d0un7geaC62ovXM4Ay
mNIunxwYDxigj9zgHUyBQZJhXYfYDuLKAO2lmasVY2lSPQwGVj6NOQFvyDkEcccY
uDZm5hkoXqaHvwKCAQEA4Zhmr6tuJGjeoUZzZqmkqczluP3Ual70WgT52MNCeKPh
Pt2cmWE5THXA2mzuCBRLtjfiv2Iw3rbE1XzYRLW4ttBCQugQSrfsvUM4KxVhUv3T
xV2oRcq7WjQbmijDpCpTkyTKghiBNYoxh/uDqji6GjPyDJsXQ+Jc4LEuagTnbuPa
N+QobQqbh3PLJ2CZa6eub57Ju684NfAcGp1W3EBE1JZGQS2GExI5d9MRybKOknF5
7hXGdp+FEOGUVSZWlME4QmG3SiVl0v8LnPZjxyTmBuQ2eZSeVO5oi5MLnW8Cp3Mp
LyTK5GqlV0rpJKtrakN2CHKqaInC4oQ9V59ijKZmMwKCAQA3PuR28SNToTqNfTON
kkgoGqz94zcLWPf+QTZ0Yy3OYAv5AtUfLGEGnHhY1mZXprN5PvWtIJ0RfalQSljx
tYoLQbAwlJKrFjMtmiYugpq2pPdKVBO/1ch2vDdK1zCXPD5EWBWY1yKq5YbmLK7v
tf2jz8lttV1eb08NR4TlhweOtX+9AlCqt0LaO0ZK1d7bL7rTIWONlZcmh1MB04jd
FI+MTXJlj4IiN/D5dkqJ2FcnsIn3FNrUATQQnx9iRnWLHOUx36xYWJdpbl027lCs
ainedFSMIu1Sgxbuc62P/qKmtfe9XQOn0E+IQejHMqL7ySQ38SkANv7FuCUqk9Si
R+3JAoIBACp0EJbsbtzBM2wsItlmAUvG+FAYzFl8JwWtvArTq/Wh+WhoMekXR74g
xrsUFrNEhn7zA2O6qbGvuVjnlpdPj0Qv6thGbZJyDEYNmUtaSadhhvG0T09a+Gjg
N3WKSE7jfDjLiqRy0hXuPKX3ae6loDgAIIPIx9cJiSrrjO18oMTYch9ke0sR3PTf
kJKHdjexJX1x50q/jZdw6QkFJOxr18gdw3jOhVWfb1siSE2poXTjj+uDA4cdO+BS
YySnA1IZDmpHk0OLyB6tylSudVZrljIjzjCRDD61Ys0pTd1Bn8E9RbnOdbQIcbb8
rOUusRcdPOAYHANyMU+JTTXpE5WeVssCggEBAMLndOzR2OVDMaVPXhJhcTncw0yH
hJEx5ADI7XzcDuYBNR7Y39lNbD3gBz8JH/3cK1dwUizhk81TX+22rf3jYVAmdJGR
CPmqq9N2dmNV+nVd9Mo+JwiIfj5XhbuzopxFXKEeXO6yNen0UcpJuSuOhTMRnqw7
cNazq9xfaRlWAH1s2e3a+g9iK5SRqi9Pfhndg3zIthKPHRz5+DnVHKVpMbLY0vSY
5/okKiNI2CtTDo4e1YzT7T21gvEnP7b1TgtNpqtKEAZVBfMUwH/w+Qyh4d6aM9Qy
UkQlEceDdYKSZkSjLkTEUU4HuHg+Zn88ppnDjiyBcUeW/3ID/OMz7VzcJSg=
-----END RSA PRIVATE KEY-----

Run:

sudo -i
daos pool query tank
ziegenbalg commented 8 months ago

@yosefe, just tried an azure debian 12 machine. Same issue. Seems to be a problem with the virtual NIC's they expose.

ziegenbalg commented 8 months ago

Here's the ucx info output from the azure machine:

# Version 1.13.1
# Git branch '<unknown>', revision 0000000
# Configured with: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --enable-mt --with-verbs --disable-backtrace-detail --disable-logging --enable-devel-headers --enable-examples --enable-cma
#define UCX_CONFIG_H              
#define ENABLE_ASSERT             1
#define ENABLE_BUILTIN_MEMCPY     1
#define ENABLE_DEBUG_DATA         0
#define ENABLE_MT                 1
#define ENABLE_PARAMS_CHECK       1
#define HAVE_ALLOCA               1
#define HAVE_ALLOCA_H             1
#define HAVE_ATTRIBUTE_NOOPTIMIZE 1
#define HAVE_CLEARENV             1
#define HAVE_CPU_SET_T            1
#define HAVE_DC_DV                1
#define HAVE_DECL_ASPRINTF        1
#define HAVE_DECL_BASENAME        1
#define HAVE_DECL_CPU_ISSET       1
#define HAVE_DECL_CPU_ZERO        1
#define HAVE_DECL_ETHTOOL_CMD_SPEED 1
#define HAVE_DECL_FMEMOPEN        1
#define HAVE_DECL_FUSE_MOUNT      0
#define HAVE_DECL_FUSE_OPEN_CHANNEL 0
#define HAVE_DECL_FUSE_UNMOUNT    0
#define HAVE_DECL_F_SETOWN_EX     1
#define HAVE_DECL_GETAUXVAL       1
#define HAVE_DECL_IBV_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_ACCESS_RELAXED_ORDERING 1
#define HAVE_DECL_IBV_ADVISE_MR   1
#define HAVE_DECL_IBV_ALLOC_DM    1
#define HAVE_DECL_IBV_ALLOC_TD    1
#define HAVE_DECL_IBV_CMD_MODIFY_QP 0
#define HAVE_DECL_IBV_CREATE_CQ_ATTR_IGNORE_OVERRUN 1
#define HAVE_DECL_IBV_CREATE_CQ_EX 1
#define HAVE_DECL_IBV_CREATE_QP_EX 1
#define HAVE_DECL_IBV_CREATE_SRQ  1
#define HAVE_DECL_IBV_CREATE_SRQ_EX 1
#define HAVE_DECL_IBV_EVENT_GID_CHANGE 1
#define HAVE_DECL_IBV_EVENT_TYPE_STR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ALLOCATE_MR 0
#define HAVE_DECL_IBV_EXP_ACCESS_ON_DEMAND 0
#define HAVE_DECL_IBV_EXP_ALLOC_DM 0
#define HAVE_DECL_IBV_EXP_ATOMIC_HCA_REPLY_BE 0
#define HAVE_DECL_IBV_EXP_CQ_IGNORE_OVERRUN 0
#define HAVE_DECL_IBV_EXP_CQ_MODERATION 0
#define HAVE_DECL_IBV_EXP_CREATE_QP 0
#define HAVE_DECL_IBV_EXP_CREATE_SRQ 0
#define HAVE_DECL_IBV_EXP_DCT_OOO_RW_DATA_PLACEMENT 0
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_PCI_ATOMIC_CAPS 0
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_RESERVED_2 0
#define HAVE_DECL_IBV_EXP_DEVICE_DC_TRANSPORT 0
#define HAVE_DECL_IBV_EXP_DEVICE_MR_ALLOCATE 0
#define HAVE_DECL_IBV_EXP_MR_FIXED_BUFFER_SIZE 0
#define HAVE_DECL_IBV_EXP_MR_INDIRECT_KLMS 0
#define HAVE_DECL_IBV_EXP_ODP_SUPPORT_IMPLICIT 0
#define HAVE_DECL_IBV_EXP_POST_SEND 0
#define HAVE_DECL_IBV_EXP_PREFETCH_MR 0
#define HAVE_DECL_IBV_EXP_PREFETCH_WRITE_ACCESS 0
#define HAVE_DECL_IBV_EXP_QPT_DC_INI 0
#define HAVE_DECL_IBV_EXP_QP_CREATE_UMR 0
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG 0
#define HAVE_DECL_IBV_EXP_QP_OOO_RW_DATA_PLACEMENT 0
#define HAVE_DECL_IBV_EXP_QUERY_DEVICE 0
#define HAVE_DECL_IBV_EXP_QUERY_GID_ATTR 0
#define HAVE_DECL_IBV_EXP_REG_MR  0
#define HAVE_DECL_IBV_EXP_SEND_EXT_ATOMIC_INLINE 0
#define HAVE_DECL_IBV_EXP_SETENV  0
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_CMP_AND_SWP 0
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_FETCH_AND_ADD 0
#define HAVE_DECL_IBV_EXP_WR_NOP  0
#define HAVE_DECL_IBV_GET_ASYNC_EVENT 1
#define HAVE_DECL_IBV_GET_DEVICE_NAME 1
#define HAVE_DECL_IBV_LINK_LAYER_ETHERNET 1
#define HAVE_DECL_IBV_LINK_LAYER_INFINIBAND 1
#define HAVE_DECL_IBV_ODP_SUPPORT_IMPLICIT 1
#define HAVE_DECL_IBV_QPF_GRH_REQUIRED 1
#define HAVE_DECL_IBV_QUERY_DEVICE_EX 1
#define HAVE_DECL_IBV_QUERY_GID   1
#define HAVE_DECL_IBV_WC_STATUS_STR 1
#define HAVE_DECL_INOTIFY_ADD_WATCH 1
#define HAVE_DECL_INOTIFY_INIT    1
#define HAVE_DECL_IN_ATTRIB       1
#define HAVE_DECL_IPPROTO_TCP     1
#define HAVE_DECL_MADV_FREE       1
#define HAVE_DECL_MADV_REMOVE     1
#define HAVE_DECL_MLX5DV_CQ_INIT_ATTR_MASK_COMPRESSED_CQE 1
#define HAVE_DECL_MLX5DV_CQ_INIT_ATTR_MASK_CQE_SIZE 1
#define HAVE_DECL_MLX5DV_CREATE_QP 1
#define HAVE_DECL_MLX5DV_DCTYPE_DCT 1
#define HAVE_DECL_MLX5DV_DEVX_SUBSCRIBE_DEVX_EVENT 1
#define HAVE_DECL_MLX5DV_INIT_OBJ 1
#define HAVE_DECL_MLX5DV_IS_SUPPORTED 1
#define HAVE_DECL_MLX5DV_OBJ_AH   1
#define HAVE_DECL_MLX5DV_QP_CREATE_ALLOW_SCATTER_TO_CQE 1
#define HAVE_DECL_MLX5DV_UAR_ALLOC_TYPE_BF 1
#define HAVE_DECL_MLX5DV_UAR_ALLOC_TYPE_NC 1
#define HAVE_DECL_POSIX_MADV_DONTNEED 1
#define HAVE_DECL_PR_SET_PTRACER  1
#define HAVE_DECL_SOL_SOCKET      1
#define HAVE_DECL_SO_KEEPALIVE    1
#define HAVE_DECL_SPEED_UNKNOWN   1
#define HAVE_DECL_STRERROR_R      1
#define HAVE_DECL_SYS_BRK         1
#define HAVE_DECL_SYS_IPC         0
#define HAVE_DECL_SYS_MADVISE     1
#define HAVE_DECL_SYS_MMAP        1
#define HAVE_DECL_SYS_MREMAP      1
#define HAVE_DECL_SYS_MUNMAP      1
#define HAVE_DECL_SYS_SHMAT       1
#define HAVE_DECL_SYS_SHMDT       1
#define HAVE_DECL_TCP_KEEPCNT     1
#define HAVE_DECL_TCP_KEEPIDLE    1
#define HAVE_DECL_TCP_KEEPINTVL   1
#define HAVE_DECL___PPC_GET_TIMEBASE 0
#define HAVE_DECL___PPC_GET_TIMEBASE_FREQ 0
#define HAVE_DEVX                 1
#define HAVE_DLFCN_H              1
#define HAVE_HW_TIMER             1
#define HAVE_IB                   1
#define HAVE_IBV_DM               1
#define HAVE_IN6_ADDR_S6_ADDR32   1
#define HAVE_INFINIBAND_MLX5DV_H  1
#define HAVE_INFINIBAND_TM_TYPES_H 1
#define HAVE_INOTIFY              1
#define HAVE_INTTYPES_H           1
#define HAVE_IP_IP_DST            1
#define HAVE_LIBGEN_H             1
#define HAVE_LIBRT                1
#define HAVE_LINUX_FUTEX_H        1
#define HAVE_LINUX_IP_H           1
#define HAVE_LINUX_MMAN_H         1
#define HAVE_MALLOC_H             1
#define HAVE_MALLOC_TRIM          1
#define HAVE_MEMALIGN             1
#define HAVE_MLX5_DV              1
#define HAVE_MLX5_HW              1
#define HAVE_MLX5_HW_UD           1
#define HAVE_MREMAP               1
#define HAVE_NETINET_IP_H         1
#define HAVE_NET_ETHERNET_H       1
#define HAVE_NUMA                 1
#define HAVE_NUMAIF_H             1
#define HAVE_NUMA_H               1
#define HAVE_ODP                  1
#define HAVE_ODP_IMPLICIT         1
#define HAVE_POSIX_MEMALIGN       1
#define HAVE_PREFETCH             1
#define HAVE_SCHED_GETAFFINITY    1
#define HAVE_SCHED_SETAFFINITY    1
#define HAVE_SIGACTION_SA_RESTORER 1
#define HAVE_SIGEVENT_SIGEV_UN_TID 1
#define HAVE_SIGHANDLER_T         1
#define HAVE_STDINT_H             1
#define HAVE_STDIO_H              1
#define HAVE_STDLIB_H             1
#define HAVE_STRERROR_R           1
#define HAVE_STRINGS_H            1
#define HAVE_STRING_H             1
#define HAVE_STRUCT_BITMASK       1
#define HAVE_STRUCT_DL_PHDR_INFO  1
#define HAVE_STRUCT_IBV_DEVICE_ATTR_EX_PCI_ATOMIC_CAPS 1
#define HAVE_STRUCT_IBV_TM_CAPS_FLAGS 1
#define HAVE_STRUCT_MLX5DV_CQ_CQ_UAR 1
#define HAVE_SYS_EPOLL_H          1
#define HAVE_SYS_EVENTFD_H        1
#define HAVE_SYS_STAT_H           1
#define HAVE_SYS_TYPES_H          1
#define HAVE_SYS_UIO_H            1
#define HAVE_TL_DC                1
#define HAVE_TL_RC                1
#define HAVE_TL_UD                1
#define HAVE_UCM_PTMALLOC286      1
#define HAVE_UNISTD_H             1
#define HAVE_WCHAR_H              1
#define HAVE___CLEAR_CACHE        1
#define HAVE___CURBRK             1
#define HAVE___SIGHANDLER_T       1
#define IBV_HW_TM                 1
#define LT_OBJDIR                 ".libs/"
#define NVALGRIND                 1
#define PACKAGE                   "ucx"
#define PACKAGE_BUGREPORT         ""
#define PACKAGE_NAME              "ucx"
#define PACKAGE_STRING            "ucx 1.13"
#define PACKAGE_TARNAME           "ucx"
#define PACKAGE_URL               ""
#define PACKAGE_VERSION           "1.13"
#define STDC_HEADERS              1
#define STRERROR_R_CHAR_P         1
#define UCM_BISTRO_HOOKS          1
#define UCS_MAX_LOG_LEVEL         UCS_LOG_LEVEL_DEBUG
#define UCT_TCP_EP_KEEPALIVE      1
#define UCT_UD_EP_DEBUG_HOOKS     0
#define UCX_CONFIGURE_FLAGS       "--build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --enable-mt --with-verbs --disable-backtrace-detail --disable-logging --enable-devel-headers --enable-examples --enable-cma"
#define UCX_MODULE_SUBDIR         "ucx"
#define VERSION                   "1.13"
#define restrict                  __restrict__
#define test_MODULES              ":module"
#define ucm_MODULES               ""
#define ucs_MODULES               ""
#define uct_MODULES               ":ib:rdmacm:cma"
#define uct_cuda_MODULES          ""
#define uct_ib_MODULES            ""
#define uct_rocm_MODULES          ""
#define ucx_perftest_MODULES      ""
#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: self
#         Device: memory0
#           Type: loopback
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: tcp
#         Device: lo
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: eth0
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11316.36/ppn + 0.00 MB/sec
#              latency: 5206 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 0
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#
#      Transport: sysv
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: posix
#     Component: posix
#             allocate: <= 183600K
#           remote key: 24 bytes
#           rkey_ptr is supported
#
#      Transport: posix
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
# < failed to open connection manager rdmacm >
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#
#      Transport: cma
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
#
# UCP context
#
#     component 0  :  self
#     component 1  :  tcp
#     component 2  :  sysv
#     component 3  :  posix
#     component 4  :  ib
#     component 5  :  rdmacm
#     component 6  :  cma
#
#            md 0  :  component 0  self 
#            md 1  :  component 1  tcp 
#            md 2  :  component 2  sysv 
#            md 3  :  component 3  posix 
#            md 4  :  component 6  cma 
#
#      resource 0  :  md 0  dev 0  flags -- self/memory0
#      resource 1  :  md 1  dev 1  flags -- tcp/lo
#      resource 2  :  md 1  dev 2  flags -- tcp/eth0
#      resource 3  :  md 2  dev 3  flags -- sysv/memory
#      resource 4  :  md 3  dev 3  flags -- posix/memory
#      resource 5  :  md 4  dev 3  flags -- cma/memory
#
# memory: 0.00MB, file descriptors: 5
# create time: 1.811 ms
#
#
# UCP worker 'goldengoose-data-client:2052'
#
#                 address: 215 bytes
#                 atomics: 0:self/memory0, 3:sysv/memory, 4:posix/memory
#
# memory: 2.30MB, file descriptors: 8
# create time: 4.412 ms
#
#
# UCP endpoint 
#
#               peer: <no debug data>
#                 lane[0]:  0:self/memory0.0 md[0]          -> md[0]/self/sysdev[255] amo#0
#
#
evgeny-leksikov commented 8 months ago

@ziegenbalg it seems like pass phrase is required to access daos client (10.0.0.19):

$ ssh admin@35.91.1.68 -i key.pem
Enter passphrase for key 'key.pem': 
Permission denied (publickey).

could you please check if there is no NAT on failed connection route? UCX does not support NAT.

ziegenbalg commented 8 months ago

@evgeny-leksikov, could you double check the key? I've just tried on two different machines and there's no password needed to unencrypt the key. I did need to do 'chmod 0600 key.pem' before it allowed me to use it.

There's no NAT, all ports are being forwarded. FWIW, this does work on google could interfaces.

evgeny-leksikov commented 8 months ago

@ziegenbalg thanks, indeed it was incompleted key file. I was able to reproduce some failure, not sure yet if this is the same as initial one, I will try to debug it tomorrow.

-bash-5.2# daos pool query tank
# [672564.425698] mercury->cls: [debug] ./src/na/na.c:584
 # na_plugin_open(): Opening plugin /lib/x86_64-linux-gnu//libna_plugin_ucx.so
[1705418837.320787] [ip-10-0-0-189:21283:0]            sock.c:325  UCX  ERROR     connect(fd=19, dest_addr=73.93.84.167:44593) failed: Connection refused
external ERR  # [672564.603695] mercury->msg: [error] ./src/na/na_ucx.c:1824
 # na_ucp_am_send_cb(): ucp_am_send_nbx() failed (Destination is unreachable)
external ERR  ### ----------------------
ziegenbalg commented 8 months ago

Yeah that does not look like the same error. Hmm, strange, when I ssh into the box and simply run 'daos pool query tank' I get the stack trace (both as admin and as root).

admin@ip-10-0-0-189:~$ ping 73.93.84.167
PING 73.93.84.167 (73.93.84.167) 56(84) bytes of data.
64 bytes from 73.93.84.167: icmp_seq=1 ttl=49 time=34.4 ms
64 bytes from 73.93.84.167: icmp_seq=2 ttl=49 time=33.8 ms

My daos server is up.

evgeny-leksikov commented 8 months ago

@ziegenbalg I build UCX 1.13.x from sources and additional debug print, you can find it in /root/ucx/install

-bash-5.2# LD_LIBRARY_PATH=/root/ucx/install/lib:$LD_LIBRARY_PATH daos pool query tank 
# [743203.076287] mercury->cls: [debug] ./src/na/na.c:584
 # na_plugin_open(): Opening plugin /lib/x86_64-linux-gnu//libna_plugin_ucx.so
[ip-10-0-0-189:77386:0:77386]      tcp_cm.c:603  Assertion `uct_tcp_cm_ep_accept_conn(ep)' failed: remote_addr=10.35.0.110:44593 local_addr=10.0.0.189:59725

ping 10.35.0.110 does not respond, do you know any details about 10.35.0.110? if UCX selects wrong interface on remte side, can you try to exclude it using UCX_NET_DEVICES variable?

evgeny-leksikov commented 8 months ago

@ziegenbalg do you have any update on the issue?

ziegenbalg commented 8 months ago

@evgeny-leksikov, not really. Apparently there's a patch from intel to get this working aws instances. Trying to track that down. Will report back here once I get it. In the mean time I'm trying xfrm interfaces to see if it gets around the connect issue.

ziegenbalg commented 8 months ago

Just saw your last comment. Sorry that slipped my radar. That 10.35.0.110 address should have been changed to the external IP address (NAT traversal issue here: https://github.com/openucx/ucx/issues/9526). Again, looking to see if ipsec/xfrm interfaces solve this.

Nothing to check from your end right now. I'll report back here once I get stuck/find a viable answer.

evgeny-leksikov commented 8 months ago

@ziegenbalg thanks for update!

evgeny-leksikov commented 7 months ago

@ziegenbalg can we close this issue as a duplicate since we have another one for NAT?

ziegenbalg commented 7 months ago

Please go ahead and close this ticket. Here are my results for a work around.

I ended up using a ipsec connection to get around NAT traversal. Specifically for daos, you need to bind to an ipsec interface (I'm using xfrm interface) as described here: https://docs.strongswan.org/docs/5.9/features/routeBasedVpn.html

Here are some sample configs:

Client swanctl.conf

connections {
   net {
      if_id_in = 42
      if_id_out = 42
      proposals = aes256-sha256-modp1024
      remote_addrs  = public_endpoint.ipv4
      version = 2
      vips = 0.0.0.0

      local {
     auth = eap-tls
     cacerts = root_ca.crt
         certs = client.crt
      }
      remote {
         auth = pubkey
         id = %any
         cacerts = root_ca.crt
      }
      children {
         net-1 {
            remote_ts = 172.16.252.100/24
        local_ts = dynamic
            mode = tunnel
        if_id_in = 42
        if_id_out = 42
            start_action = start
            esp_proposals = aes256-sha256-modp1024
            policies = yes
         }
      }
   }
}
}

An ipsec xfrm interface needs to be present before loading the strongswan connection. Setup the ipsec interface using:

ip link add ipsec0 type xfrm dev eth0 if_id 42
ip link set ipsec0 up

Important: You need to be running strongswan version 5.9.13 for this to work!! Please enable "charon.plugins.kernel-netlink.install_routes_xfrmi" in /etc/strongswan/strongswan.d/charon/kernel-netlink.conf. This ensures that routes are properly propagated to the xfrm interface.

From there have DAOS select the ipsec0 interface. I know mine was specifically a DAOS issue, but this should also solve any other ucx NAT traversal connection issues. Though, I was hoping to really not have to use a ipsec solution since it is quite the hammer.

Hope this helps anyone else facing the same issues.

evgeny-leksikov commented 7 months ago

NAT related issue, duplicate of https://github.com/openucx/ucx/issues/9526