Closed DanielBThomas closed 5 months ago
Hi, @tskisner, we've been seeing a TOAST 3 unit test failure, an excerpt below:
test_observation (toast.tests.observation.ObservationTest) ... [6]fail [7]fail [2]fail [3]fail Abort(1) on node 6 (rank 6 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
Proc 6: Traceback (most recent call last):
Proc 6: File "/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm/so-pmpm-py310-mkl-x86-64-v3-mpich-latest/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
yield
Proc 6: File "/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm/so-pmpm-py310-mkl-x86-64-v3-mpich-latest/lib/python3.10/unittest/case.py", line 591, in run
self._callTestMethod(testMethod)
Proc 6: File "/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm/so-pmpm-py310-mkl-x86-64-v3-mpich-latest/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
method()
Proc 6: File "/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm/so-pmpm-py310-mkl-x86-64-v3-mpich-latest/lib/python3.10/site-packages/toast/tests/observation.py", line 217, in test_observation
np.testing.assert_equal(obs.shared["all_A"][:], all_common)
Proc 6: File "/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm/so-pmpm-py310-mkl-x86-64-v3-mpich-latest/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 282, in assert_equal
return assert_array_equal(actual, desired, err_msg, verbose)
Proc 6: File "/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm/so-pmpm-py310-mkl-x86-64-v3-mpich-latest/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 920, in assert_array_equal
assert_array_compare(operator.__eq__, x, y, err_msg=err_msg,
Proc 6: File "/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm/so-pmpm-py310-mkl-x86-64-v3-mpich-latest/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
Proc 6: File "/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm/so-pmpm-py310-mkl-x86-64-v3-mpich-latest/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 797, in assert_array_compare
raise AssertionError(msg)
Proc 6: AssertionError:
Arrays are not equal
Mismatched elements: 24 / 24 (100%)
Max absolute difference: 0.80042836
Max relative difference: 3.73915831
x: array([[[0.697879, 0.54187 , 0.043587, 0.074629],
[0.717019, 0.661852, 0.524683, 0.446253],
[0.066302, 0.744757, 0.361155, 0.205344]],...
y: array([[[0.518774, 0.991551, 0.715202, 0.875058],
[0.692766, 0.499167, 0.658846, 0.185652],
[0.51129 , 0.15715 , 0.308961, 0.07863 ]],...
Is it a known problem? If I remember correctly, but I'm not entirely certain here, it won't appears if it is running with MPI on a single node through either OpenMPI or MPICH, but it will errs with multi-nodes with either OpenMPI or MPICH.
There are no known problems with that test case. The fact that it only shows up on jobs that span nodes makes it "feel" like an issue with the MPI or mpi4py installation / configuration. Do the mpi4py unit tests run successfully across multiple nodes?
@tskisner, one question: Would the test fail if no. of MPI processes is larger than 4?
Background: I realize that the MPI test example in the doc has 4 MPI processes. I then realize that there's only 4 detectors in the test case: https://github.com/hpc4cmb/toast/blob/toast3/src/toast/tests/observation.py#L35
I reran those tests with 4 MPI processes and it is passing. I think probably the test is only written for 4 or smaller MPI processes.
See attached files: job seems to run but many tests seem to fail, including a bunch of assertion errors. The loaded CVMFS environment is MPICH not OpenMPI.
mpi_singlenode.sh.txt mpi_singlenode.ini.txt mpi_singlenode.out.txt mpi_singlenode.err.txt mpi_singlenode.log.txt
Edit by @ickc: relevant page: https://docs.souk.ac.uk/en/latest/user/pipeline/3-MPI-applications/0-Vanilla-MPI/