my application relies on several calls to MPI_Get (a few hundreds per sync calls, like 200-600) with messages of small sizes (64 bytes to 9k roughly).
I observe a very strong performance decrease when going from one node to multiple nodes.
This issues relates to the comment of @bosilca here
There seems to be a performance issue with the one-sided support in UCX. I used the OSU get_bw benchmark, with all types of synchronizations (fence, flush, lock/unlock) and while there is some variability the performance is consistently at a fraction of the point-to-point performance (about 1/100). Even switching the RMA support over TCP is about 20 times faster (mpirun -np 2 --map-by node --mca pml ob1 --mca osc rdma --mca btl_t_if_include ib0 ../pt2pt/osu_bw).
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
OpenMPI 4.1.4 + UCX 1.12.1 but the issue is similar on OpenMPI 4.1.2 with ugni
OpenMPI 4.1.2 with ugni
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
ompi4.14 with easybuild with ucx 1.12.1 and ofi 1.14.0
openmpi 4.1.2 (by the cori support team).
Please describe the system on which you are running
openmpi 4.1.4 runs on Infiniband HDR 200Gbps, with large nodes (128 cores/nodes)
openmpi 4.1.2 runs on cray network
Details of the problem
application issue
EDIT: The issues on the IB cluster are solved now thanks to the support team
On the cray cluster, using a weak scaling approach (5.3M unknowns per rank) the MPI_Gets time goes from 0.7308 sec on 1 node to 17.6264 sec on 8 nodes (for the same part of the code).
~~Similar results are observed on the IB cluster (10M unknowns per rank) where on a single node the average bandwidth measured is 260-275 Mb/s while 8 nodes are down to 210-220Mb/s (the theoretical bandwidth is 200Gb/s).
From a timing perspective, the MPI_Get calls experience a more "normal" increase of the computational time from 1.0665 sec to 01.2820 secs.~~
Those numbers have been obtained using MPI_Win_allocate and MPI_create_hvector datatypes. In a previous version of the code using MPI_Win_create the one node case used to be as slow as the 8 nodes ones.
osu benchmarks - IB network
Following previous comments I have also run the OSU benchmark osu_get_bw for several number of calls per synchronization and the different memory allocation (see below). I compare the bandwidth measured between 2 ranks on the same node or on different nodes. Both cases barely make it to 25Gb/s while the network is supposed to deliver 200Gb/s.
questions
on the cray network: how can I reduce the performance loss
on the IB network: while the performances seem reasonable, I am confused by the measure bandwidth (both osu and real-life application). Is there any good reason for the measured bandwidth to be so low?
other related questions:
what is the expected influence of MPI_Alloc_mem on performances for IB networks? are the gain specific to RMA or is it better for every MPI calls?
what is the influence of export OMPI_MCA_pml_ucx_multi_send_nb=1? It's set to 0 by default on my configuration.
~~At this stage it's not clear to me if there is indeed a performance issue or if it's the best the implementation can do
Also maybe the configuration is not appropriate for the use we have of MPI-RMA.~~
I will be happy to try any suggestion you might have.
Thanks for your help!
Background information
my application relies on several calls to
MPI_Get
(a few hundreds per sync calls, like 200-600) with messages of small sizes (64 bytes to 9k roughly). I observe a very strong performance decrease when going from one node to multiple nodes.This issues relates to the comment of @bosilca here
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
OpenMPI 4.1.4 + UCX 1.12.1 but the issue is similar on OpenMPI 4.1.2 with ugniOpenMPI 4.1.2 with ugni
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
ompi4.14 with easybuild with ucx 1.12.1 and ofi 1.14.0Please describe the system on which you are running
openmpi 4.1.4 runs on Infiniband HDR 200Gbps, with large nodes (128 cores/nodes)Details of the problem
application issue
EDIT: The issues on the IB cluster are solved now thanks to the support team
On the cray cluster, using a weak scaling approach (5.3M unknowns per rank) the
MPI_Get
s time goes from 0.7308 sec on 1 node to 17.6264 sec on 8 nodes (for the same part of the code).~~Similar results are observed on the IB cluster (10M unknowns per rank) where on a single node the average bandwidth measured is
260-275
Mb/s while 8 nodes are down to210-220
Mb/s (the theoretical bandwidth is 200Gb/s). From a timing perspective, the MPI_Get calls experience a more "normal" increase of the computational time from1.0665
sec to01.2820
secs.~~Those numbers have been obtained using
MPI_Win_allocate
andMPI_create_hvector
datatypes. In a previous version of the code usingMPI_Win_create
the one node case used to be as slow as the 8 nodes ones.osu benchmarks - IB network
Following previous comments I have also run the OSU benchmarkosu_get_bw
for several number of calls per synchronization and the different memory allocation (see below). I compare the bandwidth measured between 2 ranks on the same node or on different nodes. Both cases barely make it to 25Gb/s while the network is supposed to deliver 200Gb/s.questions
on the IB network: while the performances seem reasonable, I am confused by the measure bandwidth (both osu and real-life application). Is there any good reason for the measured bandwidth to be so low?other related questions:
MPI_Alloc_mem
on performances for IB networks? are the gain specific to RMA or is it better for every MPI calls?export OMPI_MCA_pml_ucx_multi_send_nb=1
? It's set to0
by default on my configuration.~~At this stage it's not clear to me if there is indeed a performance issue or if it's the best the implementation can do Also maybe the configuration is not appropriate for the use we have of MPI-RMA.~~
I will be happy to try any suggestion you might have. Thanks for your help!