While for MPI_Send/recv it is very understandable that self-communication is slow, this is a very common and understandable pattern for Alltoall. Still, self communication seems very slow and seems it is not being handled by UCX. Instead, MPI does zillions of small MemcpyD2D, causing performance to be very bad (4X slowdown compared to no self-to-self).
Describe the bug
While for MPI_Send/recv it is very understandable that self-communication is slow, this is a very common and understandable pattern for Alltoall. Still, self communication seems very slow and seems it is not being handled by UCX. Instead, MPI does zillions of small MemcpyD2D, causing performance to be very bad (4X slowdown compared to no self-to-self).
image image
Steps to Reproduce
Setup and versions
lsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
==> yes to bothAdditional information (depending on the issue)