MPI-RMA - performance issues with `MPI_Get`

Background information

my application relies on several calls to MPI_Get (a few hundreds per sync calls, like 200-600) with messages of small sizes (64 bytes to 9k roughly). I observe a very strong performance decrease when going from one node to multiple nodes.

This issues relates to the comment of @bosilca here

There seems to be a performance issue with the one-sided support in UCX. I used the OSU get_bw benchmark, with all types of synchronizations (fence, flush, lock/unlock) and while there is some variability the performance is consistently at a fraction of the point-to-point performance (about 1/100). Even switching the RMA support over TCP is about 20 times faster (mpirun -np 2 --map-by node --mca pml ob1 --mca osc rdma --mca btl_t_if_include ib0 ../pt2pt/osu_bw).

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

~~OpenMPI 4.1.4 + UCX 1.12.1 but the issue is similar on OpenMPI 4.1.2 with ugni~~

OpenMPI 4.1.2 with ugni

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

~~ompi4.14 with easybuild with ucx 1.12.1 and ofi 1.14.0~~
openmpi 4.1.2 (by the cori support team).

Please describe the system on which you are running

~~openmpi 4.1.4 runs on Infiniband HDR 200Gbps, with large nodes (128 cores/nodes)~~
openmpi 4.1.2 runs on cray network

Details of the problem

application issue

EDIT: The issues on the IB cluster are solved now thanks to the support team

On the cray cluster, using a weak scaling approach (5.3M unknowns per rank) the MPI_Gets time goes from 0.7308 sec on 1 node to 17.6264 sec on 8 nodes (for the same part of the code).

~~Similar results are observed on the IB cluster (10M unknowns per rank) where on a single node the average bandwidth measured is 260-275 Mb/s while 8 nodes are down to 210-220Mb/s (the theoretical bandwidth is 200Gb/s). From a timing perspective, the MPI_Get calls experience a more "normal" increase of the computational time from 1.0665 sec to 01.2820 secs.~~

Those numbers have been obtained using MPI_Win_allocate and MPI_create_hvector datatypes. In a previous version of the code using MPI_Win_create the one node case used to be as slow as the 8 nodes ones.

osu benchmarks - IB network

Following previous comments I have also run the OSU benchmark osu_get_bw for several number of calls per synchronization and the different memory allocation (see below). I compare the bandwidth measured between 2 ranks on the same node or on different nodes. Both cases barely make it to 25Gb/s while the network is supposed to deliver 200Gb/s.

questions

on the cray network: how can I reduce the performance loss
on the IB network: while the performances seem reasonable, I am confused by the measure bandwidth (both osu and real-life application). Is there any good reason for the measured bandwidth to be so low?

open-mpi / ompi