Open ShatrovOA opened 3 years ago
I also ran into this issue. It seems, that using non-contiguous datatypes (subarray in my case) results in MPI performing a separate transfer under the hood for each contiguous chunk.
I think that this also might be happening with regular MPI without CUDA (not 100% sure, might be a cache related performance difference). The CUDA-aware version is just significantly more noticeable, since CUDA memcpy operations are blocking and have a comparatively huge overhead.
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
4.0.5 shipped with Nvidia hpc_sdk 21.2
Please describe the system on which you are running
Details of the problem
I am developing library that uses MPI derived datatypes to send and receive aligned data. Derived datatypes created as a combination of vector, hvector, contiguous and resized.
It runs fine on a CPU. I tried to execute code on GPU with the help of CUDA-Aware MPI shipped with hpc_sdk from Nvidia. I noticed that when I call MPI_Alltoall with GPU buffers, MPI starts to copy data from host to device. Single call contains more then 1 million of such calls. It is not a surprise that code runs very slow.
Can you please explain how this works? Are you aware of such behaviour?
Best regards, Oleg