open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.08k stars 845 forks source link

Multiple failures running collective-big-count tests with OMPI main branch and 'adapt' collective component #10186

Open drwootton opened 2 years ago

drwootton commented 2 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OpenMPI main branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from current main branch (3/22/22)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

 git submodule status
 1b86a35db2816ee9c0f3a41988005a2ba7d29adb 3rd-party/openpmix (v1.1.3-3481-g1b86a35d)
 91f791e209ccbdfb4b8647900d292ef51d52f37d 3rd-party/prrte (psrvr-v2.0.0rc1-4319-g91f791e2)

Please describe the system on which you are running


Details of the problem

I ran the set of self-checking tests from ompi-tests-public/collective-big-count with collective components specified as --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc

The following testcases had failures. The remaining testcases were successful:

The tests were compiled by running make in the directory containing the source files

The following environment variables were set for all tests:

BIGCOUNT_HOSTS          : -np 3
BIGCOUNT_MEMORY_PERCENT : 70
BIGCOUNT_MEMORY_DIFF    : 10

The following command failed in a MPI_Allgather call

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count

This command fails with an assert and the following traceback:

 Assertion `reserve > 0' failed.
[c656f6n01:1739873] *** Process received signal ***
[c656f6n01:1739873] Signal: Aborted (6)
[c656f6n01:1739873] Signal code:  (-6)
[c656f6n01:1739873] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
[c656f6n01:1739873] [ 1] /usr/lib64/libc.so.6(gsignal+0xd8)[0x2000006d44d8]
[c656f6n01:1739873] [ 2] /usr/lib64/libc.so.6(abort+0x164)[0x2000006b462c]
[c656f6n01:1739873] [ 3] /usr/lib64/libc.so.6(+0x37c70)[0x2000006c7c70]
[c656f6n01:1739873] [ 4] /usr/lib64/libc.so.6(__assert_fail+0x64)[0x2000006c7d14]
[c656f6n01:1739873] [ 5] /u/dwootton/ompi-master/lib/libmpi.so.0(+0x2e8108)[0x200000368108]
[c656f6n01:1739873] [ 6] /u/dwootton/ompi-master/lib/libmpi.so.0(+0x2ea478)[0x20000036a478]
[c656f6n01:1739873] [ 7] /u/dwootton/ompi-master/lib/libmpi.so.0(ompi_coll_adapt_ibcast+0x16c)[0x200000369ee8]
[c656f6n01:1739873] [ 8] /u/dwootton/ompi-master/lib/libmpi.so.0(ompi_coll_adapt_bcast+0x70)[0x2000003665ac]
[c656f6n01:1739873] [ 9] /u/dwootton/ompi-master/lib/libmpi.so.0(ompi_coll_base_allgather_intra_basic_linear+0x22c)[0x2000002aa010]
[c656f6n01:1739873] [10] /u/dwootton/ompi-master/lib/libmpi.so.0(PMPI_Allgather+0x41c)[0x20000018eb08]
[c656f6n01:1739873] [11] /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x10002fc4]
[c656f6n01:1739873] [12] /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x10002814]
[c656f6n01:1739873] [13] /usr/lib64/libc.so.6(+0x24c78)[0x2000006b4c78]
[c656f6n01:1739873] [14] /usr/lib64/libc.so.6(__libc_start_main+0xb4)[0x2000006b4e64]
[c656f6n01:1739873] *** End of error message ***
test_allgather_uniform_count: ../../../../ompi/mca/coll/base/coll_base_util.h:73: ompi_coll_base_nbc_reserve_tags: Assertion `reserve > 0' failed.

The following command failed in a MPI_Allreduce call

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allreduce_uniform_count

The assert and traceback looks similar:

test_allreduce_uniform_count: ../../../../ompi/mca/coll/base/coll_base_util.h:73: ompi_coll_base_nbc_reserve_tags: Assertion `reserve > 0' failed.
[c656f6n01:1740468] *** Process received signal ***
[c656f6n01:1740468] Signal: Aborted (6)
[c656f6n01:1740468] Signal code:  (-6)
[c656f6n01:1740468] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
[c656f6n01:1740468] [ 1] /usr/lib64/libc.so.6(gsignal+0xd8)[0x2000006d44d8]
[c656f6n01:1740468] [ 2] /usr/lib64/libc.so.6(abort+0x164)[0x2000006b462c]
[c656f6n01:1740468] [ 3] /usr/lib64/libc.so.6(+0x37c70)[0x2000006c7c70]
[c656f6n01:1740468] [ 4] /usr/lib64/libc.so.6(__assert_fail+0x64)[0x2000006c7d14]
[c656f6n01:1740468] [ 5] /u/dwootton/ompi-master/lib/libmpi.so.0(+0x2ed40c)[0x20000036d40c]
[c656f6n01:1740468] [ 6] /u/dwootton/ompi-master/lib/libmpi.so.0(+0x2effd0)[0x20000036ffd0]
[c656f6n01:1740468] [ 7] /u/dwootton/ompi-master/lib/libmpi.so.0(ompi_coll_adapt_ireduce+0x234)[0x20000036f938]
[c656f6n01:1740468] [ 8] /u/dwootton/ompi-master/lib/libmpi.so.0(ompi_coll_adapt_reduce+0x130)[0x20000036af18]
[c656f6n01:1740468] [ 9] /u/dwootton/ompi-master/lib/libmpi.so.0(mca_coll_basic_allreduce_intra+0xfc)[0x200000355ef0]
[c656f6n01:1740468] [10] /u/dwootton/ompi-master/lib/libmpi.so.0(PMPI_Allreduce+0x520)[0x200000192350]
[c656f6n01:1740468] [11] /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allreduce_uniform_count[0x10002bf4]
[c656f6n01:1740468] [12] /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allreduce_uniform_count[0x10002818]
[c656f6n01:1740468] [13] /usr/lib64/libc.so.6(+0x24c78)[0x2000006b4c78]
[c656f6n01:1740468] [14] /usr/lib64/libc.so.6(__libc_start_main+0xb4)[0x2000006b4e64]
[c656f6n01:1740468] *** End of error message ***

The following command failed with a self-check that detected invalid results then a SIGSEGV

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_alltoall_uniform_count

The error message and traceback are:

Results from MPI_Alltoall(int x 6442450941 = 25769803764 or 24.0 GB): MPI_IN_PLACE
Rank  1: ERROR: DI in     2147483647 of     2147483647 slots ( 100.0 % wrong)
Rank  2: ERROR: DI in     4294967294 of     2147483647 slots ( 200.0 % wrong)
Rank  0: ERROR: DI in     4831821818 of     2147483647 slots ( 225.0 % wrong)
--------------------- Adjust count to fit in memory: 2147483647 x  60.0% = 1288490188
Root  : payload    61847529024  57.6 GB =  16 dt x 1288490188 count x   3 peers x   1.0 inflation
Peer  : payload    61847529024  57.6 GB =  16 dt x 1288490188 count x   3 peers x   1.0 inflation
Total : payload   185542587072 172.8 GB =  57.6 GB root +  57.6 GB x   2 local peers
[c656f6n01:1740537] *** Process received signal ***
[c656f6n01:1740537] Signal: Segmentation fault (11)
[c656f6n01:1740537] Signal code: Address not mapped (1)
[c656f6n01:1740537] Failing at address: 0x1ff9a2999990
[c656f6n01:1740537] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
[c656f6n01:1740537] [ 1] /usr/lib64/libc.so.6(+0xb083c)[0x20000074083c]
[c656f6n01:1740537] [ 2] /u/dwootton/ompi-master/lib/libmpi.so.0(mca_coll_base_alltoall_intra_basic_inplace+0x22c)[0x2000002b3c94]
[c656f6n01:1740537] [ 3] /u/dwootton/ompi-master/lib/libmpi.so.0(ompi_coll_base_alltoall_intra_basic_linear+0x8c)[0x2000002b5684]
[c656f6n01:1740537] [ 4] /u/dwootton/ompi-master/lib/libmpi.so.0(PMPI_Alltoall+0x538)[0x200000193cd4]
[c656f6n01:1740537] [ 5] /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_alltoall_uniform_count[0x10002dd0]
[c656f6n01:1740537] [ 6] /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_alltoall_uniform_count[0x1000289c]
[c656f6n01:1740537] [ 7] /usr/lib64/libc.so.6(+0x24c78)[0x2000006b4c78]
[c656f6n01:1740537] [ 8] /usr/lib64/libc.so.6(__libc_start_main+0xb4)[0x2000006b4e64]
[c656f6n01:1740537] *** End of error message ***

The following command failed with an assert and traceback similar to test_allreduce_uniform_count except the failing MPI call is MPI_Alltoall:

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_alltoall_uniform_count

The following command failed with an error message indicating a self-check failed then double free or storage corruption:

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_gather_uniform_count
Results from MPI_Igather(int x 6442450941 = 25769803764 or 24.0 GB):
Rank  0: ERROR: DI in     4294967292 of     6442450941 slots (  66.7 % wrong)
---------------------
Results from MPI_Igather(double _Complex x 6442450941 = 103079215056 or 96.0 GB):
Rank  0: ERROR: DI in     4294967292 of     6442450941 slots (  66.7 % wrong)
double free or corruption (out)
[c656f6n01:1740837] *** Process received signal ***
[c656f6n01:1740837] Signal: Aborted (6)
[c656f6n01:1740837] Signal code:  (-6)
[c656f6n01:1740837] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
[c656f6n01:1740837] [ 1] /usr/lib64/libc.so.6(gsignal+0xd8)[0x2000006d44d8]
[c656f6n01:1740837] [ 2] /usr/lib64/libc.so.6(abort+0x164)[0x2000006b462c]
[c656f6n01:1740837] [ 3] /usr/lib64/libc.so.6(+0x908bc)[0x2000007208bc]
[c656f6n01:1740837] [ 4] /usr/lib64/libc.so.6(+0x9b828)[0x20000072b828]
[c656f6n01:1740837] [ 5] /usr/lib64/libc.so.6(+0x9e0ec)[0x20000072e0ec]
[c656f6n01:1740837] [ 6] ./test_gather_uniform_count[0x100030b0]
[c656f6n01:1740837] [ 7] ./test_gather_uniform_count[0x10002920]
[c656f6n01:1740837] [ 8] /usr/lib64/libc.so.6(+0x24c78)[0x2000006b4c78]
[c656f6n01:1740837] [ 9] /usr/lib64/libc.so.6(__libc_start_main+0xb4)[0x2000006b4e64]
[c656f6n01:1740837] *** End of error message ***

The following command failed with an assert and traceback similar to test_allreduce_uniform_count except the failing MPI call is MPI_Reduce:

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_reduce_uniform_count

The following command failed with a self-check message indicating the testcase generated invalid data

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_scatter_uniform_count
Results from MPI_Iscatter(int x 6442450941 = 25769803764 or 24.0 GB):
Rank  2: ERROR: DI in     2147483647 of     2147483647 slots ( 100.0 % wrong)
Rank  1: PASSED
Rank  0: PASSED

Results from MPI_Iscatter(double _Complex x 6442450941 = 103079215056 or 96.0 GB):
Rank  2: ERROR: DI in     2147483647 of     2147483647 slots ( 100.0 % wrong)
Rank  1: PASSED
Rank  0: PASSED
bosilca commented 2 years ago

I got some, maybe most, of them but there are other issues that need a little bit more thinking. There are also few corner cases where one of the processes gets killed by the OOM, and that's something you cannot trap in gdb. I'll push a PR soon for both #10186 and #10187.

wenduwan commented 4 months ago

Removing v5.0.x label - this will be a main-only change.