Self communication very slow for device buffers

lukasm91 commented 3 years ago

Describe the bug

While for MPI_Send/recv it is very understandable that self-communication is slow, this is a very common and understandable pattern for Alltoall. Still, self communication seems very slow and seems it is not being handled by UCX. Instead, MPI does zillions of small MemcpyD2D, causing performance to be very bad (4X slowdown compared to no self-to-self).

image image

Steps to Reproduce

mpic++ -o out test.cpp

# UCT version=1.11.0 revision 6ccb419
# configured with: --prefix=/usr/local/ucx --disable-assertions --disable-backtrace-detail --disable-debug --disable-doxygen-doc --disable-logging --disable-params-check --disable-static --with-cuda=/usr/local/cuda --with-gdrcopy=/usr/local/gdrcopy --with-knem=/usr/local/knem

environment

UCX_TLS=rc_x,mm,cuda_copy,gdr_copy,cuda_ipc
UCX_MEMTYPE_CACHE=n
OMPI_MCA_pml=ucx
OMPI_MCA_btl="^vader,tcp,openib,smcuda"

Setup and versions

For GPU related issues:
- 8x A100 80 GB
- Cuda:
  - Drivers version 460.73.01
  - Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv ==> yes to both

Additional information (depending on the issue)

Open MPI: 4.1.1rc4
UCT version=1.11.0 revision 6ccb419

lukasm91 commented 3 years ago

@bureddy Please add or ask if I missed something important.

Akshay-Venkatesh commented 2 years ago

@lukasm91 Sorry for the delay. Can you share test.cpp? Does the test use non-contig datatype?

openucx / ucx