open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.14k stars 858 forks source link

HCOLL: Broken allreduce with MPI_MIN #10521

Closed devreal closed 2 years ago

devreal commented 2 years ago

I'm tracking down failures in the ARMCI test suite occuring with osc/rdma that are similar to https://github.com/open-mpi/ompi/issues/10328. I found that on my system hcoll is broken and does not properly support MPI_MIN, which breaks the detection of same_size and same_disp in osc/rdma. Here is a test case:

#include <stdio.h>
#include <mpi.h>

int main(int argc, char **argv) {

  int rank, size;
  long values[2], values2[2];

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  values[0] = rank;
  values[1] = -rank;

  printf("[%d] Input: %d %d\n", rank, values[0], values[1]);
  MPI_Allreduce(values, values2, 2, MPI_LONG, MPI_MIN, MPI_COMM_WORLD);
  printf("[%d] Output: %d %d\n", rank, values2[0], values2[1]);

  MPI_Finalize();

  return 0;
}

The output:

$ mpirun -n 2 ./test_allreduce
[1] Input: 1 -1
[1] Output: 0 0
[0] Input: 0 0
[0] Output: 0 0

The expected output is 0 -1 on both processes. I get the correct result if I disable hcoll:

$ mpirun -n 2 --mca coll ^hcoll ./test_allreduce
[0] Input: 0 0
[1] Input: 1 -1
[1] Output: 0 -1
[0] Output: 0 -1

If I replace MPI_MIN with MPI_SUM the output is correct:

$ mpirun -n 2 ./test_allreduce
[1] Input: 1 -1
[1] Output: 1 -1
[0] Input: 0 0
[0] Output: 1 -1

I should mention that I am seeing the following warning, which points out potential performance issues but does not hint at correctness issues:

[hawk-login03:3405267][:33:hmca_rcache_ucs_query]  UCS version mismatch. Libhcoll binary was compiled with UCS 1.8 while the runtime version of UCS is 1.11. UCS Rcache framework will be disabled. Performance of ZCOPY BCAST algorithm may be degraded. Add -x HCOLL_RCACHE=^ucs in order to suppress this message.

If I set -x HCOLL_RCACHE=^ucs the warning disappears but the result stays incorrect.

I built a current Open MPI main branch against UCX 1.11.2 provided by the system. The UCX configuration is:

$ /opt/hlrs/non-spack/mpi/openmpi/ucx/1.11.2/bin/ucx_info -c
UCX_LOG_LEVEL=WARN
UCX_LOG_FILE_FILTER=*
UCX_LOG_FILE=
UCX_LOG_FILE_SIZE=inf
UCX_LOG_FILE_ROTATE=0
UCX_LOG_BUFFER=1K
UCX_LOG_DATA_SIZE=0
UCX_LOG_PRINT_ENABLE=n
UCX_HANDLE_ERRORS=bt
UCX_ERROR_SIGNALS=ILL,SEGV,BUS,FPE
UCX_ERROR_MAIL_TO=
UCX_ERROR_MAIL_FOOTER=
UCX_GDB_COMMAND=gdb -quiet
UCX_DEBUG_SIGNO=HUP
UCX_LOG_LEVEL_TRIGGER=FATAL
UCX_WARN_UNUSED_ENV_VARS=y
UCX_ASYNC_MAX_EVENTS=1024
UCX_ASYNC_SIGNO=ALRM
UCX_VFS_ENABLE=y
UCX_PROFILE_MODE=
UCX_PROFILE_FILE=ucx_%h_%p.prof
UCX_PROFILE_LOG_SIZE=4M
UCX_RCACHE_CHECK_PFN=0
UCX_MODULE_DIR=/opt/hlrs/non-spack/mpi/openmpi/ucx/1.11.2/lib/ucx
UCX_MODULE_LOG_LEVEL=TRACE
UCX_BUILTIN_MEMCPY_MIN=auto
UCX_BUILTIN_MEMCPY_MAX=auto
UCX_MEM_LOG_LEVEL=WARN
UCX_MEM_ALLOC_ALIGN=16
UCX_MEM_EVENTS=y
UCX_MEM_MMAP_HOOK_MODE=bistro
UCX_MEM_MALLOC_HOOKS=y
UCX_MEM_MALLOC_RELOC=y
UCX_MEM_CUDA_HOOK_MODE=reloc,bistro
UCX_MEM_DYNAMIC_MMAP_THRESH=y
UCX_MEM_DLOPEN_PROCESS_RPATH=y
UCX_MEM_MODULE_UNLOAD_PREVENT_MODE=lazy
UCX_POSIX_HUGETLB_MODE=try
UCX_POSIX_DIR=/dev/shm
UCX_POSIX_USE_PROC_LINK=y
UCX_POSIX_ALLOC=md,mmap,heap
UCX_POSIX_FAILURE=DIAG
UCX_POSIX_MAX_NUM_EPS=inf
UCX_POSIX_BW=12179.00MBps
UCX_POSIX_FIFO_SIZE=64
UCX_POSIX_SEG_SIZE=8256
UCX_POSIX_FIFO_RELEASE_FACTOR=0.500
UCX_POSIX_RX_MAX_BUFS=-1
UCX_POSIX_RX_BUFS_GROW=512
UCX_POSIX_FIFO_HUGETLB=n
UCX_POSIX_FIFO_ELEM_SIZE=128
UCX_POSIX_FIFO_MAX_POLL=16
UCX_POSIX_ERROR_HANDLING=n
UCX_SYSV_HUGETLB_MODE=try
UCX_SYSV_ALLOC=md,mmap,heap
UCX_SYSV_FAILURE=DIAG
UCX_SYSV_MAX_NUM_EPS=inf
UCX_SYSV_BW=12179.00MBps
UCX_SYSV_FIFO_SIZE=64
UCX_SYSV_SEG_SIZE=8256
UCX_SYSV_FIFO_RELEASE_FACTOR=0.500
UCX_SYSV_RX_MAX_BUFS=-1
UCX_SYSV_RX_BUFS_GROW=512
UCX_SYSV_FIFO_HUGETLB=n
UCX_SYSV_FIFO_ELEM_SIZE=128
UCX_SYSV_FIFO_MAX_POLL=16
UCX_SYSV_ERROR_HANDLING=n
UCX_SELF_ALLOC=huge,thp,md,mmap,heap
UCX_SELF_FAILURE=DIAG
UCX_SELF_MAX_NUM_EPS=inf
UCX_SELF_SEG_SIZE=8K
UCX_SELF_NUM_DEVICES=1
UCX_TCP_ALLOC=huge,thp,md,mmap,heap
UCX_TCP_FAILURE=DIAG
UCX_TCP_MAX_NUM_EPS=256
UCX_TCP_TX_SEG_SIZE=8K
UCX_TCP_RX_SEG_SIZE=64K
UCX_TCP_MAX_IOV=6
UCX_TCP_SENDV_THRESH=2K
UCX_TCP_PREFER_DEFAULT=y
UCX_TCP_PUT_ENABLE=y
UCX_TCP_CONN_NB=n
UCX_TCP_MAX_POLL=16
UCX_TCP_MAX_CONN_RETRIES=25
UCX_TCP_NODELAY=y
UCX_TCP_SNDBUF=auto
UCX_TCP_RCVBUF=auto
UCX_TCP_SYN_CNT=auto
UCX_TCP_TX_MAX_BUFS=-1
UCX_TCP_TX_BUFS_GROW=8
UCX_TCP_RX_MAX_BUFS=-1
UCX_TCP_RX_BUFS_GROW=8
UCX_TCP_PORT_RANGE=0
UCX_TCP_KEEPIDLE=10000000.00us
UCX_TCP_KEEPCNT=auto
UCX_TCP_KEEPINTVL=2000000.00us
UCX_TCP_CM_FAILURE=DIAG
UCX_TCP_CM_REUSEADDR=n
UCX_TCP_CM_PRIV_DATA_LEN=2K
UCX_TCP_CM_SNDBUF=auto
UCX_TCP_CM_RCVBUF=auto
UCX_TCP_CM_SYN_CNT=auto
UCX_NET_DEVICES=all
UCX_SHM_DEVICES=all
UCX_ACC_DEVICES=all
UCX_SELF_DEVICES=all
UCX_TLS=all
UCX_ALLOC_PRIO=md:sysv,md:posix,huge,thp,md:*,mmap,heap
UCX_SOCKADDR_TLS_PRIORITY=rdmacm,tcp,sockcm
UCX_SOCKADDR_AUX_TLS=ud
UCX_SELECT_DISTANCE_MD=cuda_cpy
UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=
UCX_WARN_INVALID_CONFIG=y
UCX_BCOPY_THRESH=0
UCX_RNDV_THRESH=auto
UCX_RNDV_SEND_NBR_THRESH=256K
UCX_RNDV_THRESH_FALLBACK=inf
UCX_RNDV_PERF_DIFF=1.000
UCX_MULTI_LANE_MAX_RATIO=4.000
UCX_MAX_EAGER_RAILS=1
UCX_MAX_RNDV_RAILS=2
UCX_RNDV_SCHEME=auto
UCX_RKEY_PTR_SEG_SIZE=512K
UCX_ZCOPY_THRESH=auto
UCX_BCOPY_BW=auto
UCX_ATOMIC_MODE=guess
UCX_ADDRESS_DEBUG_INFO=n
UCX_MAX_WORKER_ADDRESS_NAME=32
UCX_USE_MT_MUTEX=n
UCX_ADAPTIVE_PROGRESS=y
UCX_SEG_SIZE=8K
UCX_TM_THRESH=1K
UCX_TM_MAX_BB_SIZE=1K
UCX_TM_FORCE_THRESH=8K
UCX_TM_SW_RNDV=n
UCX_NUM_EPS=auto
UCX_NUM_PPN=auto
UCX_RNDV_FRAG_SIZE=512K
UCX_RNDV_PIPELINE_SEND_THRESH=inf
UCX_MEMTYPE_CACHE=y
UCX_FLUSH_WORKER_EPS=y
UCX_UNIFIED_MODE=n
UCX_CM_USE_ALL_DEVICES=y
UCX_LISTENER_BACKLOG=auto
UCX_PROTO_ENABLE=n
UCX_KEEPALIVE_INTERVAL=60000000.00us
UCX_KEEPALIVE_NUM_EPS=128
UCX_PROTO_INDIRECT_ID=auto
UCX_IB_REG_METHODS=rcache,odp,direct
UCX_IB_RCACHE_MEM_PRIO=1000
UCX_IB_RCACHE_OVERHEAD=0.18us
UCX_IB_RCACHE_ADDR_ALIGN=16
UCX_IB_RCACHE_MAX_REGIONS=inf
UCX_IB_RCACHE_MAX_SIZE=inf
UCX_IB_MEM_REG_OVERHEAD=16.00us
UCX_IB_MEM_REG_GROWTH=0.00us
UCX_IB_FORK_INIT=try
UCX_IB_ASYNC_EVENTS=y
UCX_IB_ETH_PAUSE_ON=y
UCX_IB_ODP_NUMA_POLICY=preferred
UCX_IB_ODP_PREFETCH=n
UCX_IB_ODP_MAX_SIZE=auto
UCX_IB_DEVICE_SPECS=
UCX_IB_PREFER_NEAREST_DEVICE=y
UCX_IB_INDIRECT_ATOMIC=y
UCX_IB_GID_INDEX=auto
UCX_IB_SUBNET_PREFIX=
UCX_IB_GPU_DIRECT_RDMA=try
UCX_IB_MAX_INLINE_KLM_LIST=inf
UCX_IB_PCI_BW=
UCX_IB_MLX5_DEVX=try
UCX_IB_MLX5_DEVX_OBJECTS=rcqp,rcsrq,dct,dcsrq,dci
UCX_IB_REG_MT_THRESH=4G
UCX_IB_REG_MT_CHUNK=2G
UCX_IB_REG_MT_BIND=n
UCX_IB_PCI_RELAXED_ORDERING=auto
UCX_RC_VERBS_ALLOC=huge,thp,md,mmap,heap
UCX_RC_VERBS_FAILURE=DIAG
UCX_RC_VERBS_MAX_NUM_EPS=256
UCX_RC_VERBS_SEG_SIZE=8256
UCX_RC_VERBS_TX_QUEUE_LEN=256
UCX_RC_VERBS_TX_MAX_BATCH=16
UCX_RC_VERBS_TX_MAX_POLL=16
UCX_RC_VERBS_TX_MIN_INLINE=64
UCX_RC_VERBS_TX_INLINE_RESP=64
UCX_RC_VERBS_TX_MIN_SGE=4
UCX_RC_VERBS_TX_EVENT_MOD_COUNT=0
UCX_RC_VERBS_TX_EVENT_MOD_PERIOD=0.00us
UCX_RC_VERBS_RX_EVENT_MOD_COUNT=0
UCX_RC_VERBS_RX_EVENT_MOD_PERIOD=0.00us
UCX_RC_VERBS_TX_MAX_BUFS=-1
UCX_RC_VERBS_TX_BUFS_GROW=1024
UCX_RC_VERBS_RX_QUEUE_LEN=4095
UCX_RC_VERBS_RX_MAX_BATCH=16
UCX_RC_VERBS_RX_MAX_POLL=16
UCX_RC_VERBS_RX_INLINE=64
UCX_RC_VERBS_RX_MAX_BUFS=-1
UCX_RC_VERBS_RX_BUFS_GROW=0
UCX_RC_VERBS_ADDR_TYPE=auto
UCX_RC_VERBS_IS_GLOBAL=n
UCX_RC_VERBS_SL=auto
UCX_RC_VERBS_TRAFFIC_CLASS=auto
UCX_RC_VERBS_HOP_LIMIT=255
UCX_RC_VERBS_NUM_PATHS=auto
UCX_RC_VERBS_ROCE_LOCAL_SUBNET=n
UCX_RC_VERBS_ROCE_PATH_FACTOR=1
UCX_RC_VERBS_LID_PATH_BITS=0
UCX_RC_VERBS_PKEY=auto
UCX_RC_VERBS_PATH_MTU=default
UCX_RC_VERBS_MAX_RD_ATOMIC=4
UCX_RC_VERBS_TIMEOUT=1000000.00us
UCX_RC_VERBS_RETRY_COUNT=7
UCX_RC_VERBS_RNR_TIMEOUT=1000.00us
UCX_RC_VERBS_RNR_RETRY_COUNT=7
UCX_RC_VERBS_FC_ENABLE=y
UCX_RC_VERBS_FC_WND_SIZE=512
UCX_RC_VERBS_FC_HARD_THRESH=0.250
UCX_RC_VERBS_OOO_RW=n
UCX_RC_VERBS_FENCE=auto
UCX_RC_VERBS_MAX_GET_ZCOPY=auto
UCX_RC_VERBS_TX_NUM_GET_BYTES=inf
UCX_RC_VERBS_TX_POLL_ALWAYS=n
UCX_RC_VERBS_FC_SOFT_THRESH=0.500
UCX_RC_VERBS_TX_CQ_MODERATION=64
UCX_RC_VERBS_TX_CQ_LEN=4096
UCX_RC_VERBS_MAX_AM_HDR=128
UCX_RC_VERBS_TX_MAX_WR=inf
UCX_RC_VERBS_FLUSH_MODE=auto
UCX_UD_VERBS_ALLOC=huge,thp,md,mmap,heap
UCX_UD_VERBS_FAILURE=DIAG
UCX_UD_VERBS_MAX_NUM_EPS=inf
UCX_UD_VERBS_SEG_SIZE=8K
UCX_UD_VERBS_TX_QUEUE_LEN=256
UCX_UD_VERBS_TX_MAX_BATCH=16
UCX_UD_VERBS_TX_MAX_POLL=16
UCX_UD_VERBS_TX_MIN_INLINE=64
UCX_UD_VERBS_TX_INLINE_RESP=0
UCX_UD_VERBS_TX_MIN_SGE=4
UCX_UD_VERBS_TX_EVENT_MOD_COUNT=0
UCX_UD_VERBS_TX_EVENT_MOD_PERIOD=0.00us
UCX_UD_VERBS_RX_EVENT_MOD_COUNT=0
UCX_UD_VERBS_RX_EVENT_MOD_PERIOD=0.00us
UCX_UD_VERBS_TX_MAX_BUFS=-1
UCX_UD_VERBS_TX_BUFS_GROW=1024
UCX_UD_VERBS_RX_QUEUE_LEN=4096
UCX_UD_VERBS_RX_MAX_BATCH=16
UCX_UD_VERBS_RX_MAX_POLL=16
UCX_UD_VERBS_RX_INLINE=0
UCX_UD_VERBS_RX_MAX_BUFS=-1
UCX_UD_VERBS_RX_BUFS_GROW=0
UCX_UD_VERBS_ADDR_TYPE=auto
UCX_UD_VERBS_IS_GLOBAL=n
UCX_UD_VERBS_SL=auto
UCX_UD_VERBS_TRAFFIC_CLASS=auto
UCX_UD_VERBS_HOP_LIMIT=255
UCX_UD_VERBS_NUM_PATHS=auto
UCX_UD_VERBS_ROCE_LOCAL_SUBNET=n
UCX_UD_VERBS_ROCE_PATH_FACTOR=1
UCX_UD_VERBS_LID_PATH_BITS=0
UCX_UD_VERBS_PKEY=auto
UCX_UD_VERBS_PATH_MTU=default
UCX_UD_VERBS_RX_QUEUE_LEN_INIT=128
UCX_UD_VERBS_TIMEOUT=300000000.00us
UCX_UD_VERBS_TIMER_TICK=10000.00us
UCX_UD_VERBS_TIMER_BACKOFF=2.000
UCX_UD_VERBS_ASYNC_TIMER_TICK=100000.00us
UCX_UD_VERBS_MIN_POKE_TIME=250000.00us
UCX_UD_VERBS_ETH_DGID_CHECK=y
UCX_UD_VERBS_MAX_WINDOW=1025
UCX_UD_VERBS_RX_ASYNC_MAX_POLL=64
UCX_CMA_ALLOC=huge,thp,mmap,heap
UCX_CMA_FAILURE=DIAG
UCX_CMA_MAX_NUM_EPS=inf
UCX_CMA_BW=11145.00MBps
UCX_CMA_MAX_IOV=16
UCX_CMA_SEG_SIZE=512K
UCX_CMA_TX_QUOTA=1
UCX_CMA_TX_MAX_BUFS=-1
UCX_CMA_TX_BUFS_GROW=8
UCX_KNEM_ALLOC=huge,thp,md,mmap,heap
UCX_KNEM_FAILURE=DIAG
UCX_KNEM_MAX_NUM_EPS=inf
UCX_KNEM_BW=13862.00MBps
UCX_KNEM_MAX_IOV=16
UCX_KNEM_SEG_SIZE=512K
UCX_KNEM_TX_QUOTA=1
UCX_KNEM_TX_MAX_BUFS=-1
UCX_KNEM_TX_BUFS_GROW=8
UCX_KNEM_RCACHE=try
UCX_KNEM_RCACHE_MEM_PRIO=1000
UCX_KNEM_RCACHE_OVERHEAD=0.18us
UCX_KNEM_RCACHE_ADDR_ALIGN=64
UCX_KNEM_RCACHE_MAX_REGIONS=inf
UCX_KNEM_RCACHE_MAX_SIZE=inf
UCX_XPMEM_HUGETLB_MODE=try
UCX_XPMEM_ALLOC=md,mmap,heap
UCX_XPMEM_FAILURE=DIAG
UCX_XPMEM_MAX_NUM_EPS=inf
UCX_XPMEM_BW=12179.00MBps
UCX_XPMEM_FIFO_SIZE=64
UCX_XPMEM_SEG_SIZE=8256
UCX_XPMEM_FIFO_RELEASE_FACTOR=0.500
UCX_XPMEM_RX_MAX_BUFS=-1
UCX_XPMEM_RX_BUFS_GROW=512
UCX_XPMEM_FIFO_HUGETLB=n
UCX_XPMEM_FIFO_ELEM_SIZE=128
UCX_XPMEM_FIFO_MAX_POLL=16
UCX_XPMEM_ERROR_HANDLING=n

and

$ /opt/hlrs/non-spack/mpi/openmpi/ucx/1.11.2/bin/ucx_info -b
#define UCX_CONFIG_H              
#define ENABLE_ASSERT             1
#define ENABLE_BUILTIN_MEMCPY     1
#define ENABLE_DEBUG_DATA         0
#define ENABLE_MT                 1
#define ENABLE_PARAMS_CHECK       1
#define HAVE_1_ARG_BFD_SECTION_SIZE 0
#define HAVE_ALLOCA               1
#define HAVE_ALLOCA_H             1
#define HAVE_ATTRIBUTE_NOOPTIMIZE 1
#define HAVE_CLEARENV             1
#define HAVE_CPLUS_DEMANGLE       1
#define HAVE_CPU_SET_T            1
#define HAVE_DECL_ASPRINTF        1
#define HAVE_DECL_BASENAME        1
#define HAVE_DECL_BFD_GET_SECTION_FLAGS 1
#define HAVE_DECL_BFD_GET_SECTION_VMA 1
#define HAVE_DECL_BFD_SECTION_FLAGS 0
#define HAVE_DECL_BFD_SECTION_VMA 1
#define HAVE_DECL_CPU_ISSET       1
#define HAVE_DECL_CPU_ZERO        1
#define HAVE_DECL_ETHTOOL_CMD_SPEED 1
#define HAVE_DECL_FMEMOPEN        1
#define HAVE_DECL_FUSE_MOUNT      0
#define HAVE_DECL_FUSE_OPEN_CHANNEL 0
#define HAVE_DECL_FUSE_UNMOUNT    0
#define HAVE_DECL_F_SETOWN_EX     1
#define HAVE_DECL_IBV_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_ACCESS_RELAXED_ORDERING 0
#define HAVE_DECL_IBV_ADVISE_MR   0
#define HAVE_DECL_IBV_ALLOC_DM    0
#define HAVE_DECL_IBV_ALLOC_TD    0
#define HAVE_DECL_IBV_CMD_MODIFY_QP 1
#define HAVE_DECL_IBV_CREATE_CQ_ATTR_IGNORE_OVERRUN 0
#define HAVE_DECL_IBV_CREATE_QP_EX 1
#define HAVE_DECL_IBV_CREATE_SRQ  1
#define HAVE_DECL_IBV_CREATE_SRQ_EX 1
#define HAVE_DECL_IBV_EVENT_GID_CHANGE 1
#define HAVE_DECL_IBV_EVENT_TYPE_STR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ALLOCATE_MR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_EXP_ALLOC_DM 1
#define HAVE_DECL_IBV_EXP_ATOMIC_HCA_REPLY_BE 1
#define HAVE_DECL_IBV_EXP_CQ_IGNORE_OVERRUN 1
#define HAVE_DECL_IBV_EXP_CQ_MODERATION 1
#define HAVE_DECL_IBV_EXP_CREATE_QP 1
#define HAVE_DECL_IBV_EXP_CREATE_SRQ 1
#define HAVE_DECL_IBV_EXP_DCT_OOO_RW_DATA_PLACEMENT 1
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_PCI_ATOMIC_CAPS 1
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_RESERVED_2 1
#define HAVE_DECL_IBV_EXP_DEVICE_DC_TRANSPORT 1
#define HAVE_DECL_IBV_EXP_DEVICE_MR_ALLOCATE 1
#define HAVE_DECL_IBV_EXP_MR_FIXED_BUFFER_SIZE 1
#define HAVE_DECL_IBV_EXP_MR_INDIRECT_KLMS 1
#define HAVE_DECL_IBV_EXP_ODP_SUPPORT_IMPLICIT 1
#define HAVE_DECL_IBV_EXP_POST_SEND 1
#define HAVE_DECL_IBV_EXP_PREFETCH_MR 1
#define HAVE_DECL_IBV_EXP_PREFETCH_WRITE_ACCESS 1
#define HAVE_DECL_IBV_EXP_QPT_DC_INI 1
#define HAVE_DECL_IBV_EXP_QP_CREATE_UMR 1
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG 1
#define HAVE_DECL_IBV_EXP_QP_OOO_RW_DATA_PLACEMENT 1
#define HAVE_DECL_IBV_EXP_QUERY_DEVICE 1
#define HAVE_DECL_IBV_EXP_QUERY_GID_ATTR 1
#define HAVE_DECL_IBV_EXP_REG_MR  1
#define HAVE_DECL_IBV_EXP_SEND_EXT_ATOMIC_INLINE 1
#define HAVE_DECL_IBV_EXP_SETENV  1
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_CMP_AND_SWP 1
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_FETCH_AND_ADD 1
#define HAVE_DECL_IBV_EXP_WR_NOP  1
#define HAVE_DECL_IBV_GET_ASYNC_EVENT 1
#define HAVE_DECL_IBV_GET_DEVICE_NAME 1
#define HAVE_DECL_IBV_LINK_LAYER_ETHERNET 1
#define HAVE_DECL_IBV_LINK_LAYER_INFINIBAND 1
#define HAVE_DECL_IBV_ODP_SUPPORT_IMPLICIT 0
#define HAVE_DECL_IBV_QPF_GRH_REQUIRED 0
#define HAVE_DECL_IBV_QUERY_DEVICE_EX 1
#define HAVE_DECL_IBV_QUERY_GID   1
#define HAVE_DECL_IBV_WC_STATUS_STR 1
#define HAVE_DECL_INOTIFY_ADD_WATCH 1
#define HAVE_DECL_INOTIFY_INIT    1
#define HAVE_DECL_IN_ATTRIB       1
#define HAVE_DECL_IPPROTO_TCP     1
#define HAVE_DECL_MADV_FREE       1
#define HAVE_DECL_MADV_REMOVE     1
#define HAVE_DECL_POSIX_MADV_DONTNEED 1
#define HAVE_DECL_PR_SET_PTRACER  1
#define HAVE_DECL_SOL_SOCKET      1
#define HAVE_DECL_SO_KEEPALIVE    1
#define HAVE_DECL_SPEED_UNKNOWN   1
#define HAVE_DECL_STRERROR_R      1
#define HAVE_DECL_SYS_BRK         1
#define HAVE_DECL_SYS_IPC         0
#define HAVE_DECL_SYS_MADVISE     1
#define HAVE_DECL_SYS_MMAP        1
#define HAVE_DECL_SYS_MREMAP      1
#define HAVE_DECL_SYS_MUNMAP      1
#define HAVE_DECL_SYS_SHMAT       1
#define HAVE_DECL_SYS_SHMDT       1
#define HAVE_DECL_TCP_KEEPCNT     1
#define HAVE_DECL_TCP_KEEPIDLE    1
#define HAVE_DECL_TCP_KEEPINTVL   1
#define HAVE_DECL___PPC_GET_TIMEBASE_FREQ 0
#define HAVE_DETAILED_BACKTRACE   1
#define HAVE_DLFCN_H              1
#define HAVE_EXP_UMR              1
#define HAVE_EXP_UMR_KSM          1
#define HAVE_HW_TIMER             1
#define HAVE_IB                   1
#define HAVE_IBV_DM               1
#define HAVE_IBV_EXP_DM           1
#define HAVE_IBV_EXP_QP_CREATE_UMR 1
#define HAVE_IB_EXT_ATOMICS       1
#define HAVE_IN6_ADDR_S6_ADDR32   1
#define HAVE_INOTIFY              1
#define HAVE_INTTYPES_H           1
#define HAVE_IP_IP_DST            1
#define HAVE_LIBGEN_H             1
#define HAVE_LIBRT                1
#define HAVE_LINUX_FUTEX_H        1
#define HAVE_LINUX_IP_H           1
#define HAVE_LINUX_MMAN_H         1
#define HAVE_MALLOC_H             1
#define HAVE_MALLOC_HOOK          1
#define HAVE_MALLOC_TRIM          1
#define HAVE_MASKED_ATOMICS_ENDIANNESS 1
#define HAVE_MEMALIGN             1
#define HAVE_MEMORY_H             1
#define HAVE_MREMAP               1
#define HAVE_NETINET_IP_H         1
#define HAVE_NET_ETHERNET_H       1
#define HAVE_NUMA                 1
#define HAVE_NUMAIF_H             1
#define HAVE_NUMA_H               1
#define HAVE_ODP                  1
#define HAVE_ODP_IMPLICIT         1
#define HAVE_POSIX_MEMALIGN       1
#define HAVE_PREFETCH             1
#define HAVE_SCHED_GETAFFINITY    1
#define HAVE_SCHED_SETAFFINITY    1
#define HAVE_SIGACTION_SA_RESTORER 1
#define HAVE_SIGEVENT_SIGEV_UN_TID 1
#define HAVE_SIGHANDLER_T         1
#define HAVE_STDINT_H             1
#define HAVE_STDLIB_H             1
#define HAVE_STRERROR_R           1
#define HAVE_STRINGS_H            1
#define HAVE_STRING_H             1
#define HAVE_STRUCT_BITMASK       1
#define HAVE_STRUCT_DL_PHDR_INFO  1
#define HAVE_STRUCT_IBV_ASYNC_EVENT_ELEMENT_DCT 1
#define HAVE_STRUCT_IBV_EXP_CREATE_SRQ_ATTR_DC_OFFLOAD_PARAMS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_EXP_DEVICE_CAP_FLAGS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_CAPS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_CAPS_PER_TRANSPORT_CAPS_DC_ODP_CAPS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_MR_MAX_SIZE 1
#define HAVE_STRUCT_IBV_EXP_QP_INIT_ATTR_MAX_INL_RECV 1
#define HAVE_SYS_EPOLL_H          1
#define HAVE_SYS_EVENTFD_H        1
#define HAVE_SYS_STAT_H           1
#define HAVE_SYS_TYPES_H          1
#define HAVE_SYS_UIO_H            1
#define HAVE_TL_RC                1
#define HAVE_TL_UD                1
#define HAVE_UCM_PTMALLOC286      1
#define HAVE_UNISTD_H             1
#define HAVE_VERBS_EXP_H          1
#define HAVE___CLEAR_CACHE        1
#define HAVE___CURBRK             1
#define HAVE___SIGHANDLER_T       1
#define IBV_HW_TM                 1
#define LT_OBJDIR                 ".libs/"
#define NVALGRIND                 1
#define PACKAGE                   "ucx"
#define PACKAGE_BUGREPORT         ""
#define PACKAGE_NAME              "ucx"
#define PACKAGE_STRING            "ucx 1.11"
#define PACKAGE_TARNAME           "ucx"
#define PACKAGE_URL               ""
#define PACKAGE_VERSION           "1.11"
#define STDC_HEADERS              1
#define STRERROR_R_CHAR_P         1
#define UCM_BISTRO_HOOKS          1
#define UCS_MAX_LOG_LEVEL         UCS_LOG_LEVEL_TRACE_POLL
#define UCT_TCP_EP_KEEPALIVE      1
#define UCT_UD_EP_DEBUG_HOOKS     0
#define UCX_CONFIGURE_FLAGS       "--prefix=/opt/hlrs/non-spack/mpi/openmpi/ucx/1.11.2 --enable-mt --with-xpmem=/opt/hlrs/non-spack/mpi/openmpi/xpmem/2020-09-25-cae8601"
#define UCX_MODULE_SUBDIR         "ucx"
#define VERSION                   "1.11"
#define restrict                  __restrict
#define test_MODULES              ":module"
#define ucm_MODULES               ""
#define ucs_MODULES               ""
#define uct_MODULES               ":ib:cma:knem:xpmem"
#define uct_cuda_MODULES          ""
#define uct_ib_MODULES            ""
#define uct_rocm_MODULES          ""
#define ucx_perftest_MODULES      ""

For the giggles: There seems to be another bug that occurs if I enable hcoll as the only collective implementation (on a debug build):

$ mpirun -n 2 --mca coll hcoll ./test_allreduce
[hawk-login03:3390057] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:243 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
test_allreduce: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:127: mca_coll_hcoll_module_destruct: Assertion `OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (hcoll_module->previous_reduce_scatter_block_module))->obj_magic_id' failed.
[hawk-login03:3390056] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:243 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
test_allreduce: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:127: mca_coll_hcoll_module_destruct: Assertion `OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (hcoll_module->previous_reduce_scatter_block_module))->obj_magic_id' failed.
[hawk-login03:3390056] *** Process received signal ***
[hawk-login03:3390056] Signal: Aborted (6)
[hawk-login03:3390056] Signal code:  (-6)
[hawk-login03:3390057] *** Process received signal ***
[hawk-login03:3390057] Signal: Aborted (6)
[hawk-login03:3390057] Signal code:  (-6)
[hawk-login03:3390057] [ 0] [hawk-login03:3390056] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7f6b05c58b20]
[hawk-login03:3390056] [ 1] /lib64/libpthread.so.0(+0x12b20)[0x7f7919e13b20]
[hawk-login03:3390057] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f6b058b837f]
[hawk-login03:3390056] [ 2] /lib64/libc.so.6(gsignal+0x10f)[0x7f7919a7337f]
[hawk-login03:3390057] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f7919a5ddb5]
[hawk-login03:3390057] [ 3] /lib64/libc.so.6(abort+0x127)[0x7f6b058a2db5]
[hawk-login03:3390056] [ 3] /lib64/libc.so.6(+0x21c89)[0x7f6b058a2c89]
[hawk-login03:3390056] [ 4] /lib64/libc.so.6(+0x21c89)[0x7f7919a5dc89]
[hawk-login03:3390057] [ 4] /lib64/libc.so.6(+0x2fa76)[0x7f6b058b0a76]
[hawk-login03:3390056] [ 5] /lib64/libc.so.6(+0x2fa76)[0x7f7919a6ba76]
[hawk-login03:3390057] [ 5] /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-main-ucx-dbg/lib/libmpi.so.0(+0x14fade)[0x7f791a170ade]
[hawk-login03:3390057] [ 6] /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-main-ucx-dbg/lib/libmpi.so.0(+0x14fade)[0x7f6b05fb5ade]
[hawk-login03:3390056] [ 6] /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-main-ucx-dbg/lib/libmpi.so.0(mca_coll_base_comm_select+0xa5c1)[0x7f6b05f96848]
[hawk-login03:3390056] [ 7] /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-main-ucx-dbg/lib/libmpi.so.0(mca_coll_base_comm_select+0xa5c1)[0x7f791a151848]
[hawk-login03:3390057] [ 7] /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-main-ucx-dbg/lib/libmpi.so.0(ompi_mpi_init+0x54a)[0x7f791a0cbd0f]
[hawk-login03:3390057] [ 8] /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-main-ucx-dbg/lib/libmpi.so.0(ompi_mpi_init+0x54a)[0x7f6b05f10d0f]
[hawk-login03:3390056] [ 8] /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-main-ucx-dbg/lib/libmpi.so.0(MPI_Init+0x6c)[0x7f6b05f53847]
[hawk-login03:3390056] [ 9] ./test_allreduce[0x4010a9]
[hawk-login03:3390056] [10] /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-main-ucx-dbg/lib/libmpi.so.0(MPI_Init+0x6c)[0x7f791a10e847]
[hawk-login03:3390057] [ 9] ./test_allreduce[0x4010a9]
[hawk-login03:3390057] [10] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f6b058a4493]
[hawk-login03:3390056] [11] ./test_allreduce[0x4011ae]
[hawk-login03:3390056] *** End of error message ***
/lib64/libc.so.6(__libc_start_main+0xf3)[0x7f7919a5f493]
[hawk-login03:3390057] [11] ./test_allreduce[0x4011ae]
[hawk-login03:3390057] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 1 with PID 0 on node hawk-login03 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
janjust commented 2 years ago

@devreal which HCOLL version? Is this running on multiple nodes?

devreal commented 2 years ago

HCOLL version is 4.6.3125-fc613b79 and I'm running on a single node.

janjust commented 2 years ago

@devreal regarding the for the giggles experiment, if you set -mca coll hcoll,libnbc,basic - does it still fail? I get a failure too with -mca hcoll, but setting -mca coll hcoll,libnbc,basic works, partially because hcoll doesn't support all collectives, shouldn't segfault though.

As far as reproducer - yep, I can reproduce, we'll look into it Thanks for reporting

devreal commented 2 years ago

@janjust Thanks for looking into this. I can confirm that setting -mca coll hcoll,libnbc,basic avoids the assertion.

vspetrov commented 2 years ago

@janjust looks like we incorrectly map OPAL_DATATYPE_LONG to DTE_UNT32 in mca/coll/hcoll/coll_hcoll_dtypes.h (it should be DTE_INT32 - signed). -1 in the test must get casted to UNT32_MAX, hence wrong result.

janjust commented 2 years ago

@vspetrov why 32, shouldn't a long be 64? ah I see now, for cases when long is 4

janjust commented 2 years ago

I'll push a PR shortly

vspetrov commented 2 years ago

I think we have it incorrect for both branches (sizeof_long == 4 and == 8)

awlauria commented 2 years ago

Closing - PR's merged.