Open drwootton opened 2 years ago
There are similar failures with OpenMPI v4.1.x for the test_alltoall_uniform_count, test_gather_uniform_count and test_scatter_uniform_count testcases when running with the --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc option.
There are similar failures for all four failing testcases with OpenMPI v4.1.x and using the --mca coll_han_priority 100 --mca coll han,basic,sm,self,inter,libnbc options
There are similar failures for all four failing testcases with OpenMPI v4.1.x and using the --mca coll tuned,basic,sm,self,inter,libnbc option
@drwootton Were any of these issues fixed on main and could be back-ported to the v4.0.x / v4.1.x branches?
@jsquyres I don't see the problem in main with test-allgather-uniform-count but it could be masked by the other failure in this test on the main branch. I don't see the failure in test-allreduce-uniform-count in any main branch run. I think I see the same failure in test-alltoall-uniform-count on the main branch. I think I see the same failure in test-gather-uniform-count on the main branch. I see a failure that could be the same failure in test-scatter-uniform-count on the main branch, but without the read error or SIGSEGV.
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
OpenMPI v4.1.x branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Built from current v4.1.x branch (3/22/22)
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.git submodule status does not display anything.
Please describe the system on which you are running
Details of the problem
I ran the set of self-checking tests from ompi-tests-public/collective-big-count with collective components specified as --mca coll basic,sm,self,inter,libnbc
The following testcases had failures. The remaining testcases were successful:
The tests were compiled by running make in the directory containing the source files
The following environment variables were set for all tests:
BIGCOUNT_HOSTS : -np 3 BIGCOUNT_MEMORY_PERCENT : 70 BIGCOUNT_MEMORY_DIFF : 10
This command failed with a self-check error message, a read error, and a SIGSEGV in MPI_Allgather.
These are the error messages and the traceback.
This command failed with a self-check error message and a SIGSEGV in MPI_Alltoall. This failure looks similar to the same testcase's failure in issue #10186 with the OpenMPI main branch. The second task reports a read error and the third task fails with a SIGSEGV in MPI_Alltoall and a different traceback.
This is the error message and traceback.
This command failed with a self-check error message then a double free or storage corruption error. This failure looks similar to the failure #10186 for the same testcase using the OpenMPI main branch .
This is the self-check error message and traceback.
This command failed with a self-check error message the SIGSEGV in MPI_Wait.
This is the error message and traceback.