open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 858 forks source link

Derived types with CUDA-Aware MPI #8720

Open ShatrovOA opened 3 years ago

ShatrovOA commented 3 years ago

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.0.5 shipped with Nvidia hpc_sdk 21.2

Please describe the system on which you are running


Details of the problem

I am developing library that uses MPI derived datatypes to send and receive aligned data. Derived datatypes created as a combination of vector, hvector, contiguous and resized.

It runs fine on a CPU. I tried to execute code on GPU with the help of CUDA-Aware MPI shipped with hpc_sdk from Nvidia. I noticed that when I call MPI_Alltoall with GPU buffers, MPI starts to copy data from host to device. Single call contains more then 1 million of such calls. It is not a surprise that code runs very slow.

Can you please explain how this works? Are you aware of such behaviour?

cuda_mpi_alltoall_issue

Best regards, Oleg

RuRo commented 2 years ago

I also ran into this issue. It seems, that using non-contiguous datatypes (subarray in my case) results in MPI performing a separate transfer under the hood for each contiguous chunk.

I think that this also might be happening with regular MPI without CUDA (not 100% sure, might be a cache related performance difference). The CUDA-aware version is just significantly more noticeable, since CUDA memcpy operations are blocking and have a comparatively huge overhead.