CUDA + MPI_TYPE_INDEXED + MPI_SEND/RECV super slow

chhu commented 8 months ago

Background information

We developed a CFD code some years ago for arbitrary 3D shapes and parallelize via domain decomposition. To exchange halo/ghost cell data we used gather/scatter loops to combine data into a linear buffer before sending. That worked very well and surprisingly so for GPU2GPU, even with the detour of copying that buffer into host mem first, freeing MPI from knowing/caring about device-memory. Now, as multi-GPU environments become standard, we wanted to see if we can improve parallelism by transferring device2device memory with CUDA aware OpenMPI.

To make the code even more simple we switched to TYPE_INDEXED and sent / received directly with that type instead of first creating a linear buffer. This works very well in classic HPC environment (about 5% faster than prev method), but simply applying the same logic to device memory has tremendous impact on speed (factor 500 slower). The detour with linearizing first in device memory via cusparseGather() and cusparseScatter() yielded at least the same speed as before (with / without the detour to host memory).

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Tested with OMPI4.1 and latest 5.1 branch, same outcome.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Compiled on system with latest UCX + CUDA12.3.1

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

+088c88499514c3c55aee0c433d6f4fb46b0c3587 3rd-party/openpmix (v1.1.3-3973-g088c8849) bc5de986e13c14db575141714a8b5e0089f4d0bf 3rd-party/prrte (psrvr-v2.0.0rc1-4741-gbc5de986e1) c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)

Please describe the system on which you are running

Operating system/version: RedHat8
Computer hardware: Dual AMD Epyc 3rd gen with 8x NVIDIA RTX A5000
Network type: MLX5 (not used)

Details of the problem

Exchanging ghost data with TYPE_INDEXED and MPI_ISEND / IRECV creates massive slow-down (x500) compared to linearizing buffer first
To my surprise, exchanging buffers GPU2GPU is almost identical than copy to host first and exchange there. Am I doing something wrong or am I expecting too much? (not using gdr copy yet!)

Thanks a lot for your help!

bosilca commented 8 months ago

The current approach of packing/unpacking device memory issues one cudaMemcpy per item in the indexed datatype. The performance will be terrible. This is something I wanted to take a look at for a long time, but I can't give you a time estimate for a potential fix. Meanwhile, for device memory I would suggest you pack the data yourself (with a device function if possible).

chhu commented 8 months ago

Thanks, appreciate it. Let me know if I can help. As I wrote, something like https://docs.nvidia.com/cuda/cusparse/#cusparsegather and scatter do the trick.

Is including gdr_copy worth the effort? Or is it only useful in combination with InfiniBand?

Affinity handling by mpirun is also not so optimal (just my opinion), rank should auto--bind close to same cuda device (or rank % cuda_device_count). I don't like the manual discovery method suggested by FAQ... But that's another topic.

rhc54 commented 8 months ago

rank should auto--bind close to same cuda device (or rank % cuda_device_count)

FWIW: some of us are working on that now

jsquyres commented 8 months ago

I would suggest that any fixes be applied to v5.0.x and later, and probably not be back-ported to v4.1.x.

chhu commented 8 months ago

But then you would need to mark MPI_TYPE_INDEXED as an experimental feature and not expose it to users without warnings. The way it is implemented right now is worse than not implementing it at all (forcing a user to find alternatives). I was lucky to came from the other direction (packing using device functions, linear transfer), so I know it would work rather fast, but if a user approaches this by using MPI_TYPE_INDEXED first it would be rather frustrating and discouraging, assuming OMPI always works optimal.

open-mpi / ompi