Closed wzamazon closed 1 year ago
I am actively looking into root cause, but want to cut this issue to raise awareness.
For the test case I am running, good performance is about 2.00 second per iteration, with han enable the number is about 2.16 second per iteration. (There are fluctuations between runs. these numbers are average).
A little update:
It looks like there is a tiny difference between the allreduce result of MPI_DOUBLE type between han and tuned.
The different is ~2e-19.
However small, this difference has accumulative effect, and causing Ansys fluent to do more calculation to converge, hence regressed performance.
If I look at the binary value of the result, I can see that the value are exactly off by 1.
For example: han result is 3f54fd6251a350b4
, tuned result is 3f54fd6251a350b5
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
I tried both 5.0.x branch and 4.1.x branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
build from source
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.For 5.0.x branch
For 4.1.x branch, there is no submodule
Please describe the system on which you are running
Details of the problem
I was testing ansys fluent with Open MPI 5, using 8 nodes and each node has 96 cores.
I first noticed that there are about 6% to 10% performance difference between Open MPI 4.1.x and Open MPI 5.0.x branch.
I later noticed the performance difference was from the usage of coll/han, e.g. when using Open MPI 5.0.x branch, if I disable han by setting
--mca coll_han_priority 1
. I got the same performance of Open MPI 4.Using Open MPI 4.1.x branch, if I enable han, I got the bad performance.