open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.14k stars 858 forks source link

MPI-RMA - performance issues with `MPI_Get` #10573

Open thomasgillis opened 2 years ago

thomasgillis commented 2 years ago

Background information

my application relies on several calls to MPI_Get (a few hundreds per sync calls, like 200-600) with messages of small sizes (64 bytes to 9k roughly). I observe a very strong performance decrease when going from one node to multiple nodes.

This issues relates to the comment of @bosilca here

There seems to be a performance issue with the one-sided support in UCX. I used the OSU get_bw benchmark, with all types of synchronizations (fence, flush, lock/unlock) and while there is some variability the performance is consistently at a fraction of the point-to-point performance (about 1/100). Even switching the RMA support over TCP is about 20 times faster (mpirun -np 2 --map-by node --mca pml ob1 --mca osc rdma --mca btl_t_if_include ib0 ../pt2pt/osu_bw).

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OpenMPI 4.1.4 + UCX 1.12.1 but the issue is similar on OpenMPI 4.1.2 with ugni

OpenMPI 4.1.2 with ugni

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running


Details of the problem

application issue

EDIT: The issues on the IB cluster are solved now thanks to the support team

On the cray cluster, using a weak scaling approach (5.3M unknowns per rank) the MPI_Gets time goes from 0.7308 sec on 1 node to 17.6264 sec on 8 nodes (for the same part of the code).

~~Similar results are observed on the IB cluster (10M unknowns per rank) where on a single node the average bandwidth measured is 260-275 Mb/s while 8 nodes are down to 210-220Mb/s (the theoretical bandwidth is 200Gb/s). From a timing perspective, the MPI_Get calls experience a more "normal" increase of the computational time from 1.0665 sec to 01.2820 secs.~~

Those numbers have been obtained using MPI_Win_allocate and MPI_create_hvector datatypes. In a previous version of the code using MPI_Win_create the one node case used to be as slow as the 8 nodes ones.

osu benchmarks - IB network

Following previous comments I have also run the OSU benchmark osu_get_bw for several number of calls per synchronization and the different memory allocation (see below). I compare the bandwidth measured between 2 ranks on the same node or on different nodes. Both cases barely make it to 25Gb/s while the network is supposed to deliver 200Gb/s.

questions

other related questions:

~~At this stage it's not clear to me if there is indeed a performance issue or if it's the best the implementation can do Also maybe the configuration is not appropriate for the use we have of MPI-RMA.~~

I will be happy to try any suggestion you might have. Thanks for your help!

jsquyres commented 2 years ago

@open-mpi/ucx FYI