open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 859 forks source link

Multiple failures running collective-big-count tests with OMPI v4.1.x branch #10220

Open drwootton opened 2 years ago

drwootton commented 2 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OpenMPI v4.1.x branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from current v4.1.x branch (3/22/22)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

git submodule status does not display anything.

Please describe the system on which you are running


Details of the problem

I ran the set of self-checking tests from ompi-tests-public/collective-big-count with collective components specified as --mca coll basic,sm,self,inter,libnbc

The following testcases had failures. The remaining testcases were successful:

The tests were compiled by running make in the directory containing the source files

The following environment variables were set for all tests:

BIGCOUNT_HOSTS : -np 3 BIGCOUNT_MEMORY_PERCENT : 70 BIGCOUNT_MEMORY_DIFF : 10

This command failed with a self-check error message, a read error, and a SIGSEGV in MPI_Allgather.

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll basic,sm,self,inter,libnbc /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count

These are the error messages and the traceback.

Results from MPI_Allgather(double _Complex x 3865470564 = 61847529024 or 57.6 GB):  MPI_IN_PLACE
Rank  1: PASSED
Rank  2: PASSED
Rank  0: PASSED
---------------------
Results from MPI_Iallgather(int x 6442450941 = 25769803764 or 24.0 GB):  MPI_IN_PLACE
Rank  1: ERROR: DI in     4294967292 of     6442450941 slots (  66.7 % wrong)
Rank  0: ERROR: DI in     4294967292 of     6442450941 slots (  66.7 % wrong)
Rank  2: ERROR: DI in     2147483645 of     6442450941 slots (  33.3 % wrong)
--------------------- Adjust count to fit in memory: 2147483647 x  60.0% = 1288490188
Root  : payload    61847529024  57.6 GB =  16 dt x 1288490188 count x   3 peers x   1.0 inflation
Peer  : payload    61847529024  57.6 GB =  16 dt x 1288490188 count x   3 peers x   1.0 inflation
Total : payload   185542587072 172.8 GB =  57.6 GB root +  57.6 GB x   2 local peers
[c656f6n01:2532344] Read -1, expected 20615843008, errno = 14
---------------------
Results from MPI_Iallgather(double _Complex x 3865470564 = 61847529024 or 57.6 GB):  MPI_IN_PLACE
[c656f6n01:2532343] Read -1, expected 20615843008, errno = 14
[c656f6n01:2532345] *** Process received signal ***
[c656f6n01:2532345] Signal: Segmentation fault (11)
[c656f6n01:2532345] Signal code: Address not mapped (1)
[c656f6n01:2532345] Failing at address: 0x1ff9a5999990
[c656f6n01:2532345] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
[c656f6n01:2532345] [ 1] /lib64/libc.so.6(+0xb083c)[0x20000043083c]
[c656f6n01:2532345] [ 2] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_pml_ob1.so(+0x29df0)[0x200002de9df0]
[c656f6n01:2532345] [ 3] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x2a0)[0x200002ded00c]
[c656f6n01:2532345] [ 4] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_pml_ob1.so(+0x1d41c)[0x200002ddd41c]
[c656f6n01:2532345] [ 5] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_pml_ob1.so(+0x1d4dc)[0x200002ddd4dc]
[c656f6n01:2532345] [ 6] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x1c4)[0x200002ddec5c]
[c656f6n01:2532345] [ 7] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x1d8)[0x20000287a244]
[c656f6n01:2532345] [ 8] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_btl_vader.so(+0xa368)[0x20000287a368]
[c656f6n01:2532345] [ 9] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_btl_vader.so(+0xa6e0)[0x20000287a6e0]
[c656f6n01:2532345] [10] /u/dwootton/ompi-4-1-x/lib/libopen-pal.so.40(opal_progress+0x5c)[0x2000007533f4]
[c656f6n01:2532345] [11] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(+0x87e18)[0x200000107e18]
[c656f6n01:2532345] [12] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(ompi_request_default_wait+0x38)[0x200000107e88]
[c656f6n01:2532345] [13] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(PMPI_Wait+0x1d0)[0x2000001c43b4]
[c656f6n01:2532345] [14] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x100030d8]
[c656f6n01:2532345] [15] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x100029d0]
[c656f6n01:2532345] [16] /lib64/libc.so.6(+0x24c78)[0x2000003a4c78]
[c656f6n01:2532345] [17] /lib64/libc.so.6(__libc_start_main+0xb4)[0x2000003a4e64]
[c656f6n01:2532345] *** End of error message ***

This command failed with a self-check error message and a SIGSEGV in MPI_Alltoall. This failure looks similar to the same testcase's failure in issue #10186 with the OpenMPI main branch. The second task reports a read error and the third task fails with a SIGSEGV in MPI_Alltoall and a different traceback.

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll basic,sm,self,inter,libnbc /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_alltoall_uniform_count

This is the error message and traceback.

---------------------
Results from MPI_Alltoall(int x 6442450941 = 25769803764 or 24.0 GB): MPI_IN_PLACE
Rank  1: ERROR: DI in     2147483647 of     2147483647 slots ( 100.0 % wrong)
Rank  2: ERROR: DI in     4294967294 of     2147483647 slots ( 200.0 % wrong)
Rank  0: ERROR: DI in     4563402743 of     2147483647 slots ( 212.5 % wrong)
--------------------- Adjust count to fit in memory: 2147483647 x  60.0% = 1288490188
Root  : payload    61847529024  57.6 GB =  16 dt x 1288490188 count x   3 peers x   1.0 inflation
Peer  : payload    61847529024  57.6 GB =  16 dt x 1288490188 count x   3 peers x   1.0 inflation
Total : payload   185542587072 172.8 GB =  57.6 GB root +  57.6 GB x   2 local peers
[c656f6n01:2533656] *** Process received signal ***
[c656f6n01:2533656] Signal: Segmentation fault (11)
[c656f6n01:2533656] Signal code: Address not mapped (1)
[c656f6n01:2533656] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
[c656f6n01:2533656] [ 1] /lib64/libc.so.6(+0xb083c)[0x20000043083c]
[c656f6n01:2533656] [ 2] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(mca_coll_base_alltoall_intra_basic_inplace+0x22c)[0x200000205660]
[c656f6n01:2533656] [ 3] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(ompi_coll_base_alltoall_intra_basic_linear+0x8c)[0x200000207050]
[c656f6n01:2533656] [ 4] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(MPI_Alltoall+0x4f4)[0x20000012e670]
[c656f6n01:2533656] [ 5] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_alltoall_uniform_count[0x10002dd0]
[c656f6n01:2533656] [ 6] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_alltoall_uniform_count[0x1000289c]
[c656f6n01:2533656] [ 7] /lib64/libc.so.6(+0x24c78)[0x2000003a4c78]
[c656f6n01:2533656] [ 8] /lib64/libc.so.6(__libc_start_main+0xb4)[0x2000003a4e64]
[c656f6n01:2533656] *** End of error message ***

---------------------
Results from MPI_Alltoall(double _Complex x 3865470564 = 61847529024 or 57.6 GB): MPI_IN_PLACE
[c656f6n01:2533657] Read -1, expected 20615843008, errno = 14
[c656f6n01:2533655] *** Process received signal ***
[c656f6n01:2533655] Signal: Segmentation fault (11)
[c656f6n01:2533655] Signal code: Address not mapped (1)
[c656f6n01:2533655] Failing at address: 0x1ff9a5999990
[c656f6n01:2533655] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
[c656f6n01:2533655] [ 1] /lib64/libc.so.6(+0xb083c)[0x20000043083c]
[c656f6n01:2533655] [ 2] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_pml_ob1.so(+0x29df0)[0x200002de9df0]
[c656f6n01:2533655] [ 3] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x2a0)[0x200002ded00c]
[c656f6n01:2533655] [ 4] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_pml_ob1.so(+0x1d41c)[0x200002ddd41c]
[c656f6n01:2533655] [ 5] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_pml_ob1.so(+0x1d4dc)[0x200002ddd4dc]
[c656f6n01:2533655] [ 6] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x1c4)[0x200002ddec5c]
[c656f6n01:2533655] [ 7] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x1d8)[0x20000287a244]
[c656f6n01:2533655] [ 8] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_btl_vader.so(+0xa368)[0x20000287a368]
[c656f6n01:2533655] [ 9] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_btl_vader.so(+0xa6e0)[0x20000287a6e0]
[c656f6n01:2533655] [10] /u/dwootton/ompi-4-1-x/lib/libopen-pal.so.40(opal_progress+0x5c)[0x2000007533f4]
[c656f6n01:2533655] [11] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_pml_ob1.so(+0x14524)[0x200002dd4524]
[c656f6n01:2533655] [12] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x618)[0x200002dd72a8]
[c656f6n01:2533655] [13] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(mca_coll_base_alltoall_intra_basic_inplace+0x324)[0x200000205758]
[c656f6n01:2533655] [14] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(ompi_coll_base_alltoall_intra_basic_linear+0x8c)[0x200000207050]
[c656f6n01:2533655] [15] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(MPI_Alltoall+0x4f4)[0x20000012e670]
[c656f6n01:2533655] [16] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_alltoall_uniform_count[0x10002dd0]
[c656f6n01:2533655] [17] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_alltoall_uniform_count[0x1000289c]
[c656f6n01:2533655] [18] /lib64/libc.so.6(+0x24c78)[0x2000003a4c78]
[c656f6n01:2533655] [19] /lib64/libc.so.6(__libc_start_main+0xb4)[0x2000003a4e64]
[c656f6n01:2533655] *** End of error message ***

This command failed with a self-check error message then a double free or storage corruption error. This failure looks similar to the failure #10186 for the same testcase using the OpenMPI main branch .

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll basic,sm,self,inter,libnbc /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_gather_uniform_count

This is the self-check error message and traceback.

Results from MPI_Gather(double _Complex x 6442450941 = 103079215056 or 96.0 GB):
Rank  0: PASSED
---------------------
Results from MPI_Igather(int x 6442450941 = 25769803764 or 24.0 GB):
Rank  0: ERROR: DI in     4294967292 of     6442450941 slots (  66.7 % wrong)
---------------------
Results from MPI_Igather(double _Complex x 6442450941 = 103079215056 or 96.0 GB):
Rank  0: ERROR: DI in     4294967292 of     6442450941 slots (  66.7 % wrong)
double free or corruption (out)
[c656f6n01:2534488] *** Process received signal ***
[c656f6n01:2534488] Signal: Aborted (6)
[c656f6n01:2534488] Signal code:  (-6)
[c656f6n01:2534488] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
[c656f6n01:2534488] [ 1] /lib64/libc.so.6(gsignal+0xd8)[0x2000003c44d8]
[c656f6n01:2534488] [ 2] /lib64/libc.so.6(abort+0x164)[0x2000003a462c]
[c656f6n01:2534488] [ 3] /lib64/libc.so.6(+0x908bc)[0x2000004108bc]
[c656f6n01:2534488] [ 4] /lib64/libc.so.6(+0x9b828)[0x20000041b828]
[c656f6n01:2534488] [ 5] /lib64/libc.so.6(+0x9e0ec)[0x20000041e0ec]
[c656f6n01:2534488] [ 6] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_gather_uniform_count[0x100030b0]
[c656f6n01:2534488] [ 7] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_gather_uniform_count[0x10002920]
[c656f6n01:2534488] [ 8] /lib64/libc.so.6(+0x24c78)[0x2000003a4c78]
[c656f6n01:2534488] [ 9] /lib64/libc.so.6(__libc_start_main+0xb4)[0x2000003a4e64]

This command failed with a self-check error message the SIGSEGV in MPI_Wait.

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll basic,sm,self,inter,libnbc /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_scatter_uniform_count

This is the error message and traceback.

Results from MPI_Iscatter(int x 6442450941 = 25769803764 or 24.0 GB):
Rank  2: ERROR: DI in     2147483647 of     2147483647 slots ( 100.0 % wrong)
Rank  1: PASSED
Rank  0: PASSED
---------------------
Results from MPI_Iscatter(double _Complex x 6442450941 = 103079215056 or 96.0 GB):
[c656f6n01:2536216] Read -1, expected 34359738352, errno = 14
[c656f6n01:2536214] *** Process received signal ***
[c656f6n01:2536214] Signal: Segmentation fault (11)
[c656f6n01:2536214] Signal code: Invalid permissions (2)
[c656f6n01:2536214] Failing at address: 0x20000bfffff0
[c656f6n01:2536214] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
[c656f6n01:2536214] [ 1] /lib64/libc.so.6(+0xb083c)[0x20000043083c]
[c656f6n01:2536214] [ 2] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_pml_ob1.so(+0x29df0)[0x200002de9df0]
[c656f6n01:2536214] [ 3] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x2a0)[0x200002ded00c]
[c656f6n01:2536214] [ 4] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_pml_ob1.so(+0x1d41c)[0x200002ddd41c]
[c656f6n01:2536214] [ 5] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_pml_ob1.so(+0x1d4dc)[0x200002ddd4dc]
[c656f6n01:2536214] [ 6] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x1c4)[0x200002ddec5c]
[c656f6n01:2536214] [ 7] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x1d8)[0x20000287a244]
[c656f6n01:2536214] [ 8] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_btl_vader.so(+0xa368)[0x20000287a368]
[c656f6n01:2536214] [ 9] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_btl_vader.so(+0xa6e0)[0x20000287a6e0]
[c656f6n01:2536214] [10] /u/dwootton/ompi-4-1-x/lib/libopen-pal.so.40(opal_progress+0x5c)[0x2000007533f4]
[c656f6n01:2536214] [11] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(+0x87e18)[0x200000107e18]
[c656f6n01:2536214] [12] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(ompi_request_default_wait+0x38)[0x200000107e88]
[c656f6n01:2536214] [13] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(PMPI_Wait+0x1d0)[0x2000001c43b4]
[c656f6n01:2536214] [14] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_scatter_uniform_count[0x10002f08]
[c656f6n01:2536214] [15] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_scatter_uniform_count[0x10002980]
[c656f6n01:2536214] [16] /lib64/libc.so.6(+0x24c78)[0x2000003a4c78]
[c656f6n01:2536214] [17] /lib64/libc.so.6(__libc_start_main+0xb4)[0x2000003a4e64]
[c656f6n01:2536214] *** End of error message ***
drwootton commented 2 years ago

There are similar failures with OpenMPI v4.1.x for the test_alltoall_uniform_count, test_gather_uniform_count and test_scatter_uniform_count testcases when running with the --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc option.

drwootton commented 2 years ago

There are similar failures for all four failing testcases with OpenMPI v4.1.x and using the --mca coll_han_priority 100 --mca coll han,basic,sm,self,inter,libnbc options

drwootton commented 2 years ago

There are similar failures for all four failing testcases with OpenMPI v4.1.x and using the --mca coll tuned,basic,sm,self,inter,libnbc option

jsquyres commented 2 years ago

@drwootton Were any of these issues fixed on main and could be back-ported to the v4.0.x / v4.1.x branches?

drwootton commented 2 years ago

@jsquyres I don't see the problem in main with test-allgather-uniform-count but it could be masked by the other failure in this test on the main branch. I don't see the failure in test-allreduce-uniform-count in any main branch run. I think I see the same failure in test-alltoall-uniform-count on the main branch. I think I see the same failure in test-gather-uniform-count on the main branch. I see a failure that could be the same failure in test-scatter-uniform-count on the main branch, but without the read error or SIGSEGV.