Bad persistent collective communication performance

mikesoehner commented 4 months ago

Thank you for taking the time to submit an issue!

Background information

Tested on HPE Apollo system.

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

OpenMPI 5.0.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed from source (./configure --with-tm=/opt/pbs).

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Please describe the system on which you are running

Operating system/version: Rocky Linux release 8.9 (Green Obsidian)
Computer hardware: AMD EPYC Rome processors
Network type: Infiniband HDR based interconnect

Details of the problem

I wanted to benchmark the persistent collective communiaction functions. For that I used the osu benchmark (v7.2) and replaced the collective calls with the MPI_Start and MPI_Wait of the corresponding persistent request. While doing this I noticed a severely worse performance of the persistent functions (see below). I then used another OpenMPI installation configured with --enable-debug to check on what happens in OpenMPI.

What I came across is that the system does not choose the hcoll module for the nonblocking or persistent communication (nonblocking and persistent collective performance is almost equal). If blocking communication is used, the hcoll module is utilized. Upon inspection of the OpenMPI code, it seems the corresponding field in the hcoll_collectives struct (see openmpi-5.0.3/ompi/mca/coll/hcoll/coll_hcoll_module.c) are not set. According to some digging are the values in that struct set in some mellanox lib and I can, therefore not investigate them further.

Results of the osu benchmark (v7.2), done on 2 nodes, totaling 256 cores:

Blocking MPI_Allreduce version:

# OSU MPI Allreduce Latency Test v7.2 # Datatype: MPI_CHAR. # Size Avg Latency(us) 1 6.53 2 6.53 4 6.58 8 6.81 16 7.83 32 8.29 64 6.89 128 7.19 256 8.69 512 8.25 1024 9.06 2048 11.61 4096 14.59 8192 21.90 16384 43.24 32768 1155.47 65536 712.10 131072 1037.84 262144 1350.60 524288 1858.64 1048576 3054.11

Persistent MPI_Allreduce version:

# OSU MPI Allreduce Latency Test v7.2 # Datatype: MPI_CHAR. # Size Avg Latency(us) 1 14.44 2 14.44 4 14.46 8 14.44 16 14.46 32 15.74 64 16.19 128 19.05 256 19.99 512 22.85 1024 26.99 2048 35.06 4096 58.76 8192 106.67 16384 222.95 32768 380.89 65536 520.35 131072 956.96 262144 2071.19 524288 3585.26 1048576 6536.82

bosilca commented 4 months ago

That's because the only OMPI collective component supporting persistent collective is libnbc, and the performance are, as you noticed, terrible. Let me see if I can come with a quick solution.

ggouaillardet commented 4 months ago

Keep in mind the main goal of non blocking collectives is to overlap communications and computations, so you should not naively expect they are on par performance wise.

open-mpi / ompi