openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.14k stars 424 forks source link

Self communication very slow for device buffers #6972

Open lukasm91 opened 3 years ago

lukasm91 commented 3 years ago

Describe the bug

While for MPI_Send/recv it is very understandable that self-communication is slow, this is a very common and understandable pattern for Alltoall. Still, self communication seems very slow and seems it is not being handled by UCX. Instead, MPI does zillions of small MemcpyD2D, causing performance to be very bad (4X slowdown compared to no self-to-self).

image image

Steps to Reproduce

Setup and versions

Additional information (depending on the issue)

lukasm91 commented 3 years ago

@bureddy Please add or ask if I missed something important.

Akshay-Venkatesh commented 2 years ago

@lukasm91 Sorry for the delay. Can you share test.cpp? Does the test use non-contig datatype?