inducer / meshmode

High-order unstructured mesh representation and discrete function spaces
https://documen.tician.de/meshmode/
25 stars 24 forks source link

`libfabric=1.17.0-3` on Debian causes MPI tests to fail with `MPI_ERR_OTHER` #370

Open inducer opened 1 year ago

inducer commented 1 year ago

Sample CI failure: https://gitlab.tiker.net/inducer/meshmode/-/jobs/533461

Similar failure in grudge: https://gitlab.tiker.net/inducer/grudge/-/jobs/533485

Sample traceback:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/.env/lib/python3.11/site-packages/mpi4py/run.py", line 208, in <module>
    main()
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/.env/lib/python3.11/site-packages/mpi4py/run.py", line 198, in main
    run_command_line(args)
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/.env/lib/python3.11/site-packages/mpi4py/run.py", line 47, in run_command_line
    run_path(sys.argv[0], run_name='__main__')
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/test/test_partition.py", line 609, in <module>
    _test_mpi_boundary_swap(dim, order, num_groups)
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/test/test_partition.py", line 426, in _test_mpi_boundary_swap
    conns = bdry_setup_helper.complete_some()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/meshmode/distributed.py", line 332, in complete_some
    data = [self._internal_mpi_comm.recv(status=status)]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "mpi4py/MPI/Comm.pyx", line 1438, in mpi4py.MPI.Comm.recv
  File "mpi4py/MPI/msgpickle.pxi", line 341, in mpi4py.MPI.PyMPI_recv
  File "mpi4py/MPI/msgpickle.pxi", line 303, in mpi4py.MPI.PyMPI_recv_match
mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list

Downgrading to libfabric (see here) appears to resolve this.

This is the code in mpi4py that ultimately fails, it's a matched receive (mrecv).

@majosm Got any ideas? (Pinging you since the two of us last touched this code.)

matthiasdiener commented 1 year ago

Maybe this could be a workaround - we disable mpi4py's mprobe in mirgecom due to a similar crash (observed in Spectrum MPI, https://github.com/illinois-ceesd/mirgecom/issues/132):

https://github.com/illinois-ceesd/mirgecom/blob/babc6d2b9859719a3ba4a45dc91a6915583f175d/mirgecom/mpi.py#L183-L186

inducer commented 1 year ago

Thanks for the tip! Though it seems that setting recv_mprobe = False does not avoid this particular issue.

inducer commented 1 year ago

Exciting news: while I don't know what exactly the issue is, OpenMPI 4.1.5-1 seems to include a fix that makes it work properly with the previously-offending version of libfabric1.