Open junghans opened 2 weeks ago
I've never tested on s390 but looking at the log I suspect this is an MPI issue. The tests are passing when they don't use MPI or run only on a single node, as soon as the test tries 2 or more notes it fails.
Is MPI configured correctly in the test environment?
The test environment is just the Fedora package, @keszybz would know the details.
The "mpi environment" is just what the test sets up. The build is done in a dedicated VM, the hw_info.log
file linked from the build describes the machine.
The test does this:
%check
# allow openmpi to oversubscribe, i.e. runs test with more
# cores than the builder has
export PRTE_MCA_rmaps_default_mapping_policy=:oversubscribe
for mpi in mpich openmpi; do
test -n "${mpi}" && module load mpi/${mpi}-%{_arch}
%ctest
test -n "${mpi}" && module unload mpi/${mpi}-%{_arch}
done
i.e.
export PRTE_MCA_rmaps_default_mapping_policy=:oversubscribe
for mpi in mpich openmpi; do
test -n "${mpi}" && module load mpi/${mpi}-x86_64
/usr/bin/ctest --test-dir "redhat-linux-build" \
--output-on-failure \
--force-new-ctest-process \
-j${RPM_BUILD_NCPUS}
test -n "${mpi}" && module unload mpi/${mpi}-x86_64
done
I can try to answer some general questions, but I know nothing about this package and about as much about s390x ;)
All MPI tests are failing, the non-MPI tests are passing. The log does not contain details, e.g., the output of ctest -V
Also, for some reason, the tests are running in parallel which further messes up the ctest output. CMake tells ctest to run everything in series, otherwise we can get really nasty over-subscription of resources. There are multiple tests that take 12 MPI ranks, and each rank may or may not be using multiple threads.
How can we get the details from at least one failing test, e.g., the test_reshape3d
running with 4 ranks.
—output-on-failure
should be the same as -V
on error, but let me add a -j1
to the ctest
.
@junghans Let me know if the PR helped or if you need to use -D Heffte_SEQUENTIAL_TESTING
I would like to close this issue before tagging the release.
I did another build, https://koji.fedoraproject.org/koji/taskinfo?taskID=125159590:
Test project /builddir/build/BUILD/heffte-2.4.0-build/heffte-master/s390x-redhat-linux-gnu-mpich
Start 1: heffte_fortran_fftw
Start 2: unit_tests_nompi
1/25 Test #1: heffte_fortran_fftw .............. Passed 0.08 sec
Start 3: unit_tests_stock
Start 7: heffte_fft3d_np1
2/25 Test #3: unit_tests_stock ................. Passed 0.00 sec
Start 16: heffte_fft3d_r2c_np1
3/25 Test #16: heffte_fft3d_r2c_np1 ............. Passed 0.08 sec
Start 22: test_cos_np1
4/25 Test #7: heffte_fft3d_np1 ................. Passed 0.13 sec
5/25 Test #22: test_cos_np1 ..................... Passed 0.08 sec
Start 8: heffte_fft3d_np2
6/25 Test #8: heffte_fft3d_np2 .................***Failed 0.08 sec
--------------------------------------------------------------------------------
heffte::fft class
--------------------------------------------------------------------------------
constructor heffte::fft3d<stock> pass
Abort(676410127) on node 1 (rank 1 in comm 0): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(7454).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPID_Barrier(167)..........................:
MPIDI_Barrier_allcomm_composition_json(132):
MPIDI_POSIX_mpi_bcast(238).................:
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not match across processes in the collective routine: Received 0 but expected 1
Start 17: heffte_fft3d_r2c_np2
7/25 Test #2: unit_tests_nompi ................. Passed 0.38 sec
8/25 Test #17: heffte_fft3d_r2c_np2 .............***Failed 0.14 sec
--------------------------------------------------------------------------------
heffte::fft_r2c class
--------------------------------------------------------------------------------
constructor heffte::fft3d_r2c<stock> pass
Abort(676410127) on node 1 (rank 1 in comm 0): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(7454).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPID_Barrier(167)..........................:
MPIDI_Barrier_allcomm_composition_json(132):
MPIDI_POSIX_mpi_bcast(238).................:
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not match across processes in the collective routine: Received 0 but expected 1
Start 4: heffte_reshape3d_np4
9/25 Test #4: heffte_reshape3d_np4 .............***Failed 0.75 sec
--------------------------------------------------------------------------------
heffte reshape methods
--------------------------------------------------------------------------------
heffte::mpi::gather_boxes pass
Start 5: heffte_reshape3d_np7
10/25 Test #5: heffte_reshape3d_np7 .............***Failed 1.70 sec
--------------------------------------------------------------------------------
heffte reshape methods
--------------------------------------------------------------------------------
heffte::mpi::gather_boxes pass
float -np 7 heffte::reshape3d_alltoall all-2-1 pass
Start 6: heffte_reshape3d_np12
11/25 Test #6: heffte_reshape3d_np12 ............***Failed 3.13 sec
--------------------------------------------------------------------------------
heffte reshape methods
--------------------------------------------------------------------------------
heffte::mpi::gather_boxes pass
Start 9: heffte_fft3d_np4
12/25 Test #9: heffte_fft3d_np4 .................***Failed 8.16 sec
--------------------------------------------------------------------------------
heffte::fft class
--------------------------------------------------------------------------------
ccomplex -np 4 test heffte::fft2d<stock> pass
Start 10: heffte_fft3d_np6
13/25 Test #10: heffte_fft3d_np6 .................***Failed 25.74 sec
--------------------------------------------------------------------------------
heffte::fft class
--------------------------------------------------------------------------------
float -np 6 test heffte::fft3d<stock> pass
Start 11: heffte_fft3d_np8
14/25 Test #11: heffte_fft3d_np8 .................***Failed 45.90 sec
--------------------------------------------------------------------------------
heffte::fft class
--------------------------------------------------------------------------------
float -np 8 test heffte::fft3d<stock> pass
Start 12: heffte_fft3d_np12
15/25 Test #12: heffte_fft3d_np12 ................***Failed 58.30 sec
--------------------------------------------------------------------------------
heffte::fft class
--------------------------------------------------------------------------------
float -np 12 test heffte::fft3d<stock> pass
Start 13: heffte_streams_np6
16/25 Test #13: heffte_streams_np6 ...............***Failed 17.86 sec
--------------------------------------------------------------------------------
heffte::fft streams
--------------------------------------------------------------------------------
ccomplex -np 6 test heffte::fft3d (stream)<stock> pass
Abort(676410127) on node 3 (rank 3 in comm 0): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(7454).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPID_Barrier(167)..........................:
MPIDI_Barrier_allcomm_composition_json(132):
MPIDI_POSIX_mpi_bcast(238).................:
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not match across processes in the collective routine: Received 0 but expected 1
Start 14: test_subcomm_np8
17/25 Test #14: test_subcomm_np8 .................***Failed 76.87 sec
--------------------------------------------------------------------------------
heffte::fft subcommunicators
--------------------------------------------------------------------------------
double -np 8 test subcommunicator<stock> pass
Start 15: test_subcomm_np12
18/25 Test #15: test_subcomm_np12 ................***Failed 116.44 sec
--------------------------------------------------------------------------------
heffte::fft subcommunicators
--------------------------------------------------------------------------------
double -np 12 test subcommunicator<stock> pass
Start 18: heffte_fft2d_r2c_np4
19/25 Test #18: heffte_fft2d_r2c_np4 .............***Failed 31.25 sec
--------------------------------------------------------------------------------
heffte::fft_r2c class
--------------------------------------------------------------------------------
float -np 4 test heffte::fft2d_r2c<stock> pass
Start 19: heffte_fft3d_r2c_np6
20/25 Test #19: heffte_fft3d_r2c_np6 .............***Failed 63.12 sec
--------------------------------------------------------------------------------
heffte::fft_r2c class
--------------------------------------------------------------------------------
float -np 6 test heffte::fft3d_r2c<stock> pass
Start 20: heffte_fft3d_r2c_np8
21/25 Test #20: heffte_fft3d_r2c_np8 .............***Failed 104.04 sec
--------------------------------------------------------------------------------
heffte::fft_r2c class
--------------------------------------------------------------------------------
float -np 8 test heffte::fft3d_r2c<stock> pass
Start 21: heffte_fft3d_r2c_np12
22/25 Test #21: heffte_fft3d_r2c_np12 ............***Failed 155.48 sec
--------------------------------------------------------------------------------
heffte::fft_r2c class
--------------------------------------------------------------------------------
float -np 12 test heffte::fft3d_r2c<stock> pass
Start 23: test_cos_np2
23/25 Test #23: test_cos_np2 .....................***Failed 1.13 sec
Start 24: test_cos_np4
24/25 Test #24: test_cos_np4 .....................***Failed 6.16 sec
--------------------------------------------------------------------------------
cosine transforms
--------------------------------------------------------------------------------
float -np 4 test cosine<stock-cos-type-II> pass
Start 25: heffte_longlong_np4
25/25 Test #25: heffte_longlong_np4 ..............***Failed 105.14 sec
--------------------------------------------------------------------------------
heffte::fft class
--------------------------------------------------------------------------------
float -np 4 test int/long long<stock> pass
24% tests passed, 19 tests failed out of 25
Total Test time (real) = 821.67 sec
The following tests FAILED:
4 - heffte_reshape3d_np4 (Failed)
5 - heffte_reshape3d_np7 (Failed)
6 - heffte_reshape3d_np12 (Failed)
8 - heffte_fft3d_np2 (Failed)
9 - heffte_fft3d_np4 (Failed)
10 - heffte_fft3d_np6 (Failed)
11 - heffte_fft3d_np8 (Failed)
12 - heffte_fft3d_np12 (Failed)
13 - heffte_streams_np6 (Failed)
14 - test_subcomm_np8 (Failed)
15 - test_subcomm_np12 (Failed)
17 - heffte_fft3d_r2c_np2 (Failed)
18 - heffte_fft2d_r2c_np4 (Failed)
19 - heffte_fft3d_r2c_np6 (Failed)
20 - heffte_fft3d_r2c_np8 (Failed)
21 - heffte_fft3d_r2c_np12 (Failed)
23 - test_cos_np2 (Failed)
24 - test_cos_np4 (Failed)
25 - heffte_longlong_np4 (Failed)
Errors while running CTest
Something is wrong with MPI, MPI_Barrier(MPI_COMM_WORLD)
should always work. This is the first MPI method called after MPI_Init()
.
-D Heffte_SEQUENTIAL_TESTING=ON
to make sure different MPI processes don't try to synch across ranks of different tests.Hard to figure this out without hands onto the hardware.
The sequential one, https://koji.fedoraproject.org/koji/taskinfo?taskID=125161367, fails as well:
4/25 Test #4: heffte_reshape3d_np4 .............***Failed 0.51 sec
--------------------------------------------------------------------------------
heffte reshape methods
--------------------------------------------------------------------------------
heffte::mpi::gather_boxes pass
Abort(676410127) on node 1 (rank 1 in comm 0): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(7454).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPID_Barrier(167)..........................:
MPIDI_Barrier_allcomm_composition_json(132):
MPIDI_POSIX_mpi_bcast(238).................:
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not match across processes in the collective routine: Received 0 but expected 1
Abort(676410127) on node 3 (rank 3 in comm 0): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(7454).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPID_Barrier(167)..........................:
MPIDI_Barrier_allcomm_composition_json(132):
MPIDI_POSIX_mpi_bcast(238).................:
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not match across processes in the collective routine: Received 0 but expected 1
But I think @mkstoyanov is right, that looks very much like a more fundamental issue in the mpich
package of Fedora.
@keszybz, who is Fedora's mpich package maintainer?
In my experience, the Red Hat family is rather paranoid. There are a bunch of flags about "hardened" and "secure" that I have not used and I don't know what those mean. I won't put it past them to have something in the environment that blocks processes from communicating with MPI. The error message says that MPI_Barrier
failed to send/recv even a single byte.
I don't think I can help here.
Let me add a MPI hello world to build and see if that fails, too!
Hmm, https://koji.fedoraproject.org/koji/taskinfo?taskID=125191181, hello world
worked:
+ mpicc /builddir/build/SOURCES/mpi_hello_world.c -o mpi_hello_world
+ mpiexec -np 12 ./mpi_hello_world
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 0 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 2 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 1 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 3 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 8 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 4 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 6 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 10 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 9 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 7 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 11 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 5 out of 12 processors
@keszybz, who is Fedora's mpich package maintainer?
That'd be me. But I only picked up mpich because nobody else wanted it. I'm not qualified to fix real issues.
@opoplawski any ideas? Otherwise, I will ping the developer mailing list.
Hmm, https://koji.fedoraproject.org/koji/taskinfo?taskID=125191181,
hello world
worked:
I can't find the source code within the build logs. In the hello-world example, do you have any code other than MPI init and the print statement? You should add at least an MPI_Barrier
, i.e.,
int me, nranks;
MPI_Comm_rank(MPI_COMM_WORLD, &me);
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
for (int i=0; i<nranks; i++) {
if (me == i)
std::cout << "hello from rank: " << me << std::endl;
MPI_Barrier(MPI_COMM_WORLD);
}
That will call the method which failed in the heffte logs and the order of the ranks should be sequential, i.e., 0, 1, 2, 3, ...
Ok, I made it print the source and added the suggested loop as well in https://koji.fedoraproject.org/koji/taskinfo?taskID=125198447
Hmm
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 0 out of 12 processors
hello from rank: 0
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 2 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 4 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 1 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 5 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 8 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 10 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 6 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 11 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 3 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 9 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 7 out of 12 processors
hello from rank: 1
hello from rank: 2
error: Bad exit status from /var/tmp/rpm-tmp.ttBmu8 (%build)
Bad exit status from /var/tmp/rpm-tmp.ttBmu8 (%build)
RPM build errors:
Child return code was: 1
The commands to read the process info, rank, comm-size, etc. Those do not require actual communication but rather on-node work. The log shows a crash on the second call to the MPI_Barrier()
so the hello-world is failing, while working on the other systems.
You can play around with send/recv to see how those act and if they work properly, but there's something wrong with MPI in this environment.
Yeah, that will need some deeper investigations.
I would just go ahead with v2.4.1 and not wait for this issue.
From https://koji.fedoraproject.org/koji/taskinfo?taskID=125093717:
Full build log: build_s390x.log.txt.zip
It says
v2.4.0
, but is actually c7c8f69ce78395040a2690bcb6984299449176ce.Aarch64, ppc64le and x86_64 work.