I was able to reproduce allreduce_in_place segfault for ompi 5.0.x without using EFA. The allreduce_in_place passes on the same node, but starts seg faulting as soon as I make the test run multi-node with shared memory. 2 ranks on 1 node pass, 2 ranks on 2 nodes pass, but 3 ranks on 2 nodes fail.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v5.0.x, c5fe4aa9a623a86f2e1da9dfae3d2dbdffe0de40
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Background information
I was able to reproduce allreduce_in_place segfault for ompi 5.0.x without using EFA. The allreduce_in_place passes on the same node, but starts seg faulting as soon as I make the test run multi-node with shared memory. 2 ranks on 1 node pass, 2 ranks on 2 nodes pass, but 3 ranks on 2 nodes fail.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v5.0.x, c5fe4aa9a623a86f2e1da9dfae3d2dbdffe0de40
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone ompi git checkout v5.0.x git submodule update --init --recursive ./autogen.pl && ./configure --enable-debug --prefix=/home/ec2-user/ompi/install && make -j install
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.git submodule status 7f6f8db13b42916b27b690b8a3f9e2757ec1417f 3rd-party/openpmix (v4.2.3-8-g7f6f8db1) c7b2c715f92495637c298249deb5493e86864ac8 3rd-party/prrte (v3.0.1rc1-36-gc7b2c715) 237ceff1a8ed996d855d69f372be9aaea44919ea config/oac (237ceff1)
Please describe the system on which you are running
Details of the problem
Core 1:
Core 2: