Usage of coll/han cause ansys/fluent performance degradation

wzamazon commented 1 year ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

I tried both 5.0.x branch and 4.1.x branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

build from source

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

For 5.0.x branch

 8e71cf0e6753192b075edffc46fe09de597dc7c1 3rd-party/openpmix (v4.2.4rc1)
 088b45ad6f01960fc9037ea44b6d3500a519eb8e 3rd-party/prrte (v3.0.1rc2-3-g088b45ad6f)
 237ceff1a8ed996d855d69f372be9aaea44919ea config/oac (237ceff)

For 4.1.x branch, there is no submodule

Please describe the system on which you are running

Operating system/version: Amazon Linux 2
Computer hardware: AMD
Network type: EFA

Details of the problem

I was testing ansys fluent with Open MPI 5, using 8 nodes and each node has 96 cores.

I first noticed that there are about 6% to 10% performance difference between Open MPI 4.1.x and Open MPI 5.0.x branch.

I later noticed the performance difference was from the usage of coll/han, e.g. when using Open MPI 5.0.x branch, if I disable han by setting --mca coll_han_priority 1. I got the same performance of Open MPI 4.

Using Open MPI 4.1.x branch, if I enable han, I got the bad performance.

wzamazon commented 1 year ago

I am actively looking into root cause, but want to cut this issue to raise awareness.

wzamazon commented 1 year ago

For the test case I am running, good performance is about 2.00 second per iteration, with han enable the number is about 2.16 second per iteration. (There are fluctuations between runs. these numbers are average).

wzamazon commented 1 year ago

A little update:

It looks like there is a tiny difference between the allreduce result of MPI_DOUBLE type between han and tuned.

The different is ~2e-19.

However small, this difference has accumulative effect, and causing Ansys fluent to do more calculation to converge, hence regressed performance.

wzamazon commented 1 year ago

If I look at the binary value of the result, I can see that the value are exactly off by 1.

For example: han result is 3f54fd6251a350b4, tuned result is 3f54fd6251a350b5

open-mpi / ompi