Derived types with CUDA-Aware MPI

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.0.5 shipped with Nvidia hpc_sdk 21.2

Please describe the system on which you are running

Operating system/version: CentOS Linux 7
Computer hardware: 2 х Intel Xeon Gold 6142 v4, 4 x nVidia Volta GV100GL, 768GB
Network type: Infiniband

Details of the problem

I am developing library that uses MPI derived datatypes to send and receive aligned data. Derived datatypes created as a combination of vector, hvector, contiguous and resized.

It runs fine on a CPU. I tried to execute code on GPU with the help of CUDA-Aware MPI shipped with hpc_sdk from Nvidia. I noticed that when I call MPI_Alltoall with GPU buffers, MPI starts to copy data from host to device. Single call contains more then 1 million of such calls. It is not a surprise that code runs very slow.

Can you please explain how this works? Are you aware of such behaviour?

Best regards, Oleg

open-mpi / ompi