simonsobs-uk / data-centre

This tracks the issues in the baseline design of the SO:UK Data Centre at Blackett
https://souk-data-centre.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

TOAST test failing with env. so-pmpm-py310-mkl-x86-64-v3-mpich-latest #43

Closed DanielBThomas closed 5 months ago

DanielBThomas commented 7 months ago

See attached files: job seems to run but many tests seem to fail, including a bunch of assertion errors. The loaded CVMFS environment is MPICH not OpenMPI.

mpi_singlenode.sh.txt mpi_singlenode.ini.txt mpi_singlenode.out.txt mpi_singlenode.err.txt mpi_singlenode.log.txt

Edit by @ickc: relevant page: https://docs.souk.ac.uk/en/latest/user/pipeline/3-MPI-applications/0-Vanilla-MPI/

ickc commented 6 months ago

Hi, @tskisner, we've been seeing a TOAST 3 unit test failure, an excerpt below:

test_observation (toast.tests.observation.ObservationTest) ... [6]fail [7]fail [2]fail [3]fail Abort(1) on node 6 (rank 6 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 6

Proc 6: Traceback (most recent call last):
Proc 6:   File "/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm/so-pmpm-py310-mkl-x86-64-v3-mpich-latest/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
    yield
Proc 6:   File "/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm/so-pmpm-py310-mkl-x86-64-v3-mpich-latest/lib/python3.10/unittest/case.py", line 591, in run
    self._callTestMethod(testMethod)
Proc 6:   File "/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm/so-pmpm-py310-mkl-x86-64-v3-mpich-latest/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
Proc 6:   File "/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm/so-pmpm-py310-mkl-x86-64-v3-mpich-latest/lib/python3.10/site-packages/toast/tests/observation.py", line 217, in test_observation
    np.testing.assert_equal(obs.shared["all_A"][:], all_common)
Proc 6:   File "/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm/so-pmpm-py310-mkl-x86-64-v3-mpich-latest/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 282, in assert_equal
    return assert_array_equal(actual, desired, err_msg, verbose)
Proc 6:   File "/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm/so-pmpm-py310-mkl-x86-64-v3-mpich-latest/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 920, in assert_array_equal
    assert_array_compare(operator.__eq__, x, y, err_msg=err_msg,
Proc 6:   File "/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm/so-pmpm-py310-mkl-x86-64-v3-mpich-latest/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
Proc 6:   File "/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm/so-pmpm-py310-mkl-x86-64-v3-mpich-latest/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 797, in assert_array_compare
    raise AssertionError(msg)
Proc 6: AssertionError: 
Arrays are not equal

Mismatched elements: 24 / 24 (100%)
Max absolute difference: 0.80042836
Max relative difference: 3.73915831
 x: array([[[0.697879, 0.54187 , 0.043587, 0.074629],
        [0.717019, 0.661852, 0.524683, 0.446253],
        [0.066302, 0.744757, 0.361155, 0.205344]],...
 y: array([[[0.518774, 0.991551, 0.715202, 0.875058],
        [0.692766, 0.499167, 0.658846, 0.185652],
        [0.51129 , 0.15715 , 0.308961, 0.07863 ]],...

Is it a known problem? If I remember correctly, but I'm not entirely certain here, it won't appears if it is running with MPI on a single node through either OpenMPI or MPICH, but it will errs with multi-nodes with either OpenMPI or MPICH.

tskisner commented 6 months ago

There are no known problems with that test case. The fact that it only shows up on jobs that span nodes makes it "feel" like an issue with the MPI or mpi4py installation / configuration. Do the mpi4py unit tests run successfully across multiple nodes?

ickc commented 5 months ago

@tskisner, one question: Would the test fail if no. of MPI processes is larger than 4?

Background: I realize that the MPI test example in the doc has 4 MPI processes. I then realize that there's only 4 detectors in the test case: https://github.com/hpc4cmb/toast/blob/toast3/src/toast/tests/observation.py#L35

ickc commented 5 months ago

I reran those tests with 4 MPI processes and it is passing. I think probably the test is only written for 4 or smaller MPI processes.