Open drew-parsons opened 2 years ago
I have an idea of the problem. If MPICH fails and Open-MPI succeeds, then I suspect the MPICH datatypes code is broken.
Can you set the MPICH build to also use ARMCI_STRIDED_METHOD=IOV
and ARMCI_IOV_METHOD=BATCHED
on the s390x config?
With ARMCI_STRIDED_METHOD=IOV
and ARMCI_IOV_METHOD=BATCHED
, the five mpi tests still fail with the same error message (including test_mpi_indexed_gets reporting the different symptom), but the other 11 tests pass:
/usr/bin/make check-TESTS
make[3]: Entering directory '/home/dparsons/armci/armci-mpi-0.3.1~beta/build-mpich'
make[4]: Entering directory '/home/dparsons/armci/armci-mpi-0.3.1~beta/build-mpich'
PASS: benchmarks/ping-pong
PASS: benchmarks/ring-flood
PASS: benchmarks/contiguous-bench
PASS: benchmarks/strided-bench
PASS: benchmarks/rmw_perf
PASS: tests/test_onesided
PASS: tests/test_onesided_shared
PASS: tests/test_onesided_shared_dla
PASS: tests/test_mutex
PASS: tests/test_mutex_rmw
PASS: tests/test_mutex_trylock
PASS: tests/test_malloc_irreg
PASS: tests/ARMCI_PutS_latency
PASS: tests/ARMCI_AccS_latency
PASS: tests/test_groups
PASS: tests/test_group_split
PASS: tests/test_malloc_group
PASS: tests/test_accs
PASS: tests/test_accs_dla
PASS: tests/test_puts
PASS: tests/test_puts_gets
PASS: tests/test_puts_gets_dla
PASS: tests/test_putv
PASS: tests/test_igop
PASS: tests/test_rmw_fadd
PASS: tests/test_parmci
PASS: tests/mpi/test_mpi_accs
FAIL: tests/mpi/test_mpi_dim
FAIL: tests/mpi/test_mpi_indexed_accs
FAIL: tests/mpi/test_mpi_indexed_gets
FAIL: tests/mpi/test_mpi_indexed_puts_gets
FAIL: tests/mpi/test_mpi_subarray_accs
PASS: tests/mpi/test_win_create
PASS: tests/mpi/test_win_model
PASS: tests/ctree/ctree_test
PASS: tests/ctree/ctree_test_rand
PASS: tests/ctree/ctree_test_rand_interval
PASS: tests/contrib/armci-perf
PASS: tests/contrib/armci-test
PASS: tests/contrib/lu/lu-block
PASS: tests/contrib/lu/lu-b-bc
PASS: tests/contrib/transp1D/transp1D-c
PASS: tests/contrib/non-blocking/simple
============================================================================
Testsuite summary for armci 0.1
============================================================================
# TOTAL: 43
# PASS: 38
# SKIP: 0
# XFAIL: 0
# FAIL: 5
# XPASS: 0
# ERROR: 0
There's a small variation in the PMPI function triggering the error. test_mpi_dim references PMPI_Accumulate:
FAIL: tests/mpi/test_mpi_dim
============================
MPI test program (2 processes)
Testing strided gets and puts
(Only std output for process 0 is printed)
--------array[5]--------
local[1:3] -> remote[0:2] -> local[1:3]
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ff7e2b3d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1fc89e) [0x3ff7e1fc89e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1c6774) [0x3ff7e1c6774]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1cce1c) [0x3ff7e1cce1c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x256b2e) [0x3ff7e256b2e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2598e6) [0x3ff7e2598e6]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25be40) [0x3ff7e25be40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(PMPI_Accumulate+0xa94) [0x3ff7e0f9044]
./tests/mpi/test_mpi_dim(+0x2980) [0x2aa1bf02980]
./tests/mpi/test_mpi_dim(main+0x6a) [0x2aa1bf0123a]
/lib/s390x-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x3ff7de24c5e]
./tests/mpi/test_mpi_dim(+0x1314) [0x2aa1bf01314]
internal ABORT - process 0
FAIL tests/mpi/test_mpi_dim (exit status: 1)
while the other 3 (apart from test_mpi_indexed_gets) reference PMPI_Win_unlock, e.g.
FAIL: tests/mpi/test_mpi_indexed_accs
=====================================
MPI RMA Strided Accumulate Test:
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ff870b3d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1fc89e) [0x3ff86ffc89e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1c6774) [0x3ff86fc6774]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1cce1c) [0x3ff86fcce1c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x24dfde) [0x3ff8704dfde]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x270a40) [0x3ff87070a40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x29125c) [0x3ff8709125c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x24fd46) [0x3ff8704fd46]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x251b20) [0x3ff87051b20]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25577a) [0x3ff8705577a]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x255ab6) [0x3ff87055ab6]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x238822) [0x3ff87038822]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x28c87e) [0x3ff8708c87e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2539e2) [0x3ff870539e2]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x26237c) [0x3ff8706237c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(PMPI_Win_unlock+0x310) [0x3ff86f0f1c0]
./tests/mpi/test_mpi_indexed_accs(main+0x21e) [0x2aa0d180fa6]
/lib/s390x-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x3ff86c24c5e]
./tests/mpi/test_mpi_indexed_accs(+0x1314) [0x2aa0d181314]
internal ABORT - process 0
FAIL tests/mpi/test_mpi_indexed_accs (exit status: 1)
(likewise test_mpi_indexed_puts_gets and test_mpi_subarray_accs) In the original build log, the test_mpi_indexed_accs referenced PMPI_Accumulate not PMPI_Win_unlock, though the other 2 already referenced PMPI_Win_unlock.
Actually, I need to report it might not be so straightforward. When I manually rebuild the original configuration on an s390x porterbox, without adding ARMCI_STRIDED_METHOD=IOV and ARMCI_IOV_METHOD=BATCHED, I get the same result. The five testmpi* tests fail for mpich, the other tests pass. Between the original build test errors and today's tests, our mpich was upgraded from 4.0 to 4.0.1, if that explains why the other tests now pass.
Without adding the extra flags, test_mpi_indexed_accs is triggered from PMPI_Accumulate, as before, not from PMPI_Win_unlock
FAIL: tests/mpi/test_mpi_indexed_accs
=====================================
MPI RMA Strided Accumulate Test:
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ffbbbb3d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ff8b133d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1fc89e) [0x3ff8b07c89e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1c6774) [0x3ff8b046774]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1cce1c) [0x3ff8b04ce1c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x24dfde) [0x3ff8b0cdfde]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x270a40) [0x3ff8b0f0a40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x29125c) [0x3ff8b11125c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x24fd46) [0x3ff8b0cfd46]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x251b20) [0x3ff8b0d1b20]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25577a) [0x3ff8b0d577a]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x255ab6) [0x3ff8b0d5ab6]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x238822) [0x3ff8b0b8822]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x28c87e) [0x3ff8b10c87e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2539e2) [0x3ff8b0d39e2]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25942e) [0x3ff8b0d942e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25be40) [0x3ff8b0dbe40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(PMPI_Accumulate+0xa94) [0x3ff8af79044]
./tests/mpi/test_mpi_indexed_accs(main+0x20e) [0x2aa25d80f96]
/lib/s390x-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x3ff8aca4c5e]
./tests/mpi/test_mpi_indexed_accs(+0x1314) [0x2aa25d81314]
internal ABORT - process 0
FAIL tests/mpi/test_mpi_indexed_accs (exit status: 1)
Can you try again with ARMCI_IOV_METHOD=CONSRV
, ARMCI_IOV_CHECKS=1
, ARMCI_SHR_BUF_METHOD=COPY
, ARMCI_RMA_NOCHECK=0
, and ARMCI_NO_FLUSH_LOCAL=1
? Those are the most conservative settings I can come up with, and might reveal something.
Hmm, with those settings (without ARMCI_STRIDED_METHOD=IOV) I'm back to 15 failures:
FAIL: benchmarks/strided-bench
FAIL: tests/ARMCI_PutS_latency
FAIL: tests/ARMCI_AccS_latency
FAIL: tests/test_accs
FAIL: tests/test_accs_dla
FAIL: tests/test_puts
FAIL: tests/test_puts_gets
FAIL: tests/test_puts_gets_dla
FAIL: tests/mpi/test_mpi_dim
FAIL: tests/mpi/test_mpi_indexed_accs
FAIL: tests/mpi/test_mpi_indexed_gets
FAIL: tests/mpi/test_mpi_indexed_puts_gets
FAIL: tests/mpi/test_mpi_subarray_accs
FAIL: tests/contrib/armci-perf
FAIL: tests/contrib/armci-test
with a touch more error output, just adding a short description of the test
AIL: benchmarks/strided-bench
==============================
Starting one-sided strided performance test with 2 processes
Trg. Rank Xdim Ydim Get (usec) Put (usec) Acc (usec) Get (MiB/s) Put (MiB/s) Acc (MiB/s)
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ff83333d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1fc89e) [0x3ff8327c89e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1c6774) [0x3ff83246774]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1cce1c) [0x3ff8324ce1c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x256b2e) [0x3ff832d6b2e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2598e6) [0x3ff832d98e6]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25be40) [0x3ff832dbe40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(PMPI_Accumulate+0xa94) [0x3ff83179044]
./benchmarks/strided-bench(+0x43ee) [0x2aa37e843ee]
./benchmarks/strided-bench(+0x5828) [0x2aa37e85828]
./benchmarks/strided-bench(main+0x2ea) [0x2aa37e82f32]
/lib/s390x-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x3ff82e24c5e]
./benchmarks/strided-bench(+0x31f4) [0x2aa37e831f4]
internal ABORT - process 0
FAIL benchmarks/strided-bench (exit status: 1)
FAIL: tests/ARMCI_PutS_latency
==============================
ARMCI_PutS Latency - local and remote completions - in usec
Dimensions(array of doubles) Latency-LocalCompeltion Latency-RemoteCompletion
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ffb38b3d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1fc89e) [0x3ffb37fc89e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1c6774) [0x3ffb37c6774]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1cce1c) [0x3ffb37cce1c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x256b2e) [0x3ffb3856b2e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2598e6) [0x3ffb38598e6]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25be40) [0x3ffb385be40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(PMPI_Accumulate+0xa94) [0x3ffb36f9044]
./tests/ARMCI_PutS_latency(+0x45be) [0x2aa1e3045be]
./tests/ARMCI_PutS_latency(+0x59f8) [0x2aa1e3059f8]
./tests/ARMCI_PutS_latency(main+0x1ae) [0x2aa1e302e96]
/lib/s390x-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x3ffb33a4c5e]
./tests/ARMCI_PutS_latency(+0x33c4) [0x2aa1e3033c4]
internal ABORT - process 0
FAIL tests/ARMCI_PutS_latency (exit status: 1)
If I activate ARMCI_STRIDED_METHOD=IOV alongside ARMCI_IOV_METHOD=CONSRV, ARMCI_IOV_CHECKS=1, ARMCI_SHR_BUF_METHOD=COPY, ARMCI_RMA_NOCHECK=0, and ARMCI_NO_FLUSH_LOCAL=1 then I'm back to the 5 failures.
A build of armci-mpi with mpich 4.0 fails tests on s390x. Tests pass for Intel and ARM architectures (amd64 and arm64 and their lesser counterparts)
The build log is available at https://buildd.debian.org/status/fetch.php?pkg=armci-mpi&arch=s390x&ver=0.3.1%7Ebeta-5&stamp=1645753186&raw=0 .
Tests pass with openmpi but 16 tests fail with mpich:
Further details of the errors are listed in the build log
There are essentially only two test errors here. Most of these failures all point at the same error
e.g.
looputil.c is actually in mpich not armci-mpi, maybe this is an mpich bug? Not sure if it's relevant to looputil.c l.813 here, but we caught a bug in incorrect assumptions about how long double alignment was implemented on s390x, exposed in mpi4py, see https://github.com/mpi4py/mpi4py/issues/91
The other error is in test_mpi_indexed_gets:
I see an error like this if there is a mismatch in libmpich.so (e.g. on amd64, running armci-mpi tests with libarmci built against mpich 4.0 but then compiling tests using libmpich1.2 from mpich 3.4.1), but that kind of mismatch shouldn't apply to the s390x build-time test failure reported here.
For reference, various tests also fail at build time for other less common architectures, evidently for different reasons. Build logs are collected at https://buildd.debian.org/status/package.php?p=armci-mpi On mips64el, test_mpi_indexed_gets fails on mpich, all tests pass with openmpi. On mipsel tests pass with mpich but fail with openmpi.
CI runtime (installation) test logs are collected at https://ci.debian.net/packages/a/armci-mpi/ (the version building with mpich is 0.3.1~beta-5 or later), showing the same test failure on s390x.