open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 856 forks source link

Assertion `reserve > 0' failed running collective-big-count tests using v4.1.x branch and --mca coll adapt,basic,sm,self,inter,libnbc option #10221

Open drwootton opened 2 years ago

drwootton commented 2 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OpenMPI v4.1.x branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from current v4.1.x branch (3/22/22)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

git submodule status does not display anything.

Please describe the system on which you are running


Details of the problem

I ran the set of self-checking tests from ompi-tests-public/collective-big-count with collective components specified as --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc

The following environment variables were set for all tests:

BIGCOUNT_HOSTS : -np 3 BIGCOUNT_MEMORY_PERCENT : 70 BIGCOUNT_MEMORY_DIFF : 10

For instance, I ran this command

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count

The command failed with this assert and traceback

test_allgather_uniform_count: ../../../../ompi/mca/coll/base/coll_base_util.h:73: ompi_coll_base_nbc_reserve_tags: Assertion `reserve > 0' failed.
[c656f6n01:2537658] *** Process received signal ***
[c656f6n01:2537658] Signal: Aborted (6)
[c656f6n01:2537658] Signal code:  (-6)
[c656f6n01:2537658] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
[c656f6n01:2537658] [ 1] /lib64/libc.so.6(gsignal+0xd8)[0x2000003c44d8]
[c656f6n01:2537658] [ 2] /lib64/libc.so.6(abort+0x164)[0x2000003a462c]
[c656f6n01:2537658] [ 3] /lib64/libc.so.6(+0x37c70)[0x2000003b7c70]
[c656f6n01:2537658] [ 4] /lib64/libc.so.6(__assert_fail+0x64)[0x2000003b7d14]
[c656f6n01:2537658] [ 5] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(+0x544c)[0x200002ef544c]
[c656f6n01:2537658] [ 6] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(+0x76a4)[0x200002ef76a4]
[c656f6n01:2537658] [ 7] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(ompi_coll_adapt_ibcast+0x12c)[0x200002ef7118]
[c656f6n01:2537658] [ 8] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(ompi_coll_adapt_bcast+0x70)[0x200002ef3a30]
[c656f6n01:2537658] [ 9] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(ompi_coll_base_allgather_intra_basic_linear+0x22c)[0x2000001fc32c]
[c656f6n01:2537658] [10] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(MPI_Allgather+0x3c0)[0x200000129ec4]
[c656f6n01:2537658] [11] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x10002fc4]
[c656f6n01:2537658] [12] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x10002814]
[c656f6n01:2537658] [13] /lib64/libc.so.6(+0x24c78)[0x2000003a4c78]
[c656f6n01:2537658] [14] /lib64/libc.so.6(__libc_start_main+0xb4)[0x2000003a4e64]
[c656f6n01:2537658] *** End of error message ***

The following testcases had this failure

The tests were compiled by running make in the directory containing the source files

jsquyres commented 2 years ago

@drwootton Were any of these issues fixed on main and could be back-ported to the v4.0.x / v4.1.x branches?

drwootton commented 2 years ago

@jsquyres I did not see this failure in any tests with the main branch other than once in test-allgather-uniform-count, so the problem may be fixed in main. I don't see any failures with the main branch for either test-bcast-uniform-count and test-reduce-uniform-count. I see the same (or very similar) failure for test-reduce-uniform-count in issue #10186. I can't tell if the problem is fixed for test-allgather-uniform-count or whether the other failure with that test in main is before the code gets to the point where this problem occurs.