Open donners-atos opened 2 years ago
@donners-atos I am not sure this is a bug:
Open MPI philosophy is a. it should work out of the box b. give the user what they explicitly request, regardless it is smart or not
In this case, the test
a. passes with the default algorithm
b. fails when an algorithm known not to support MPI_IN_PLACE
is explicitly requested
@bosilca @jsquyres any thoughts?
ok, thanks. How can I know that these algorithms do not support MPI_IN_PLACE?
Unfortunately there is no way. In OMPI design, each collective algorithm can decide to refuse to complete the collective, in which case it will fall back to a different one (according to registration priorities), with the worst case scenario of falling back to one of the basic algorithms in base or basic.
When you force a specific algorithm no such fallback exists, but if the selected algorithm refuses to complete the operations we should return the error up to the MPI API. I was under the impression that the code was doing this, but apparently it is not the case (maybe just for the non-blocking collectives).
@bosilca I made PRs for v5.0.x and v4.1.x, but ff03f43 doesn't apply cleanly to the v4.0.x tree -- there's a conflict. Could you have a look?
@donners-atos This appears to now be fixed. Can you check the latest v4.0.x and/or 4.1.x nightly snapshots?
https://www.open-mpi.org/nightly/v4.0.x/ https://www.open-mpi.org/nightly/v4.1.x/
thank you for the fix. I'll test it in the coming weeks.
@donners-atos is this still an issue for you? If we don't hear back, I'll close the bug next week.
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
repo_rev=v4.1.1-30-g535358e937
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
installed from the hpcx-2.9.0 package
Please describe the system on which you are running
Details of the problem
OMPI_MCA_coll_libnbc_iallreduce_algorithm=1 or 3 gives incorrect results from 4 processes. Here's a small program that reproduces the problem.
The program prints the expected result and the actual result for each process. With OMPI_MCA_coll_libnbc_iallreduce_algorithm=1 or 3 this gives incorrect results from 4 processes: