armci-mpi fails in multi-node execution with openmpi

drew-parsons commented 2 years ago

Background information

What version of Open MPI

4.1.2

Describe how Open MPI was installed

Debian packages 4.1.2-1 (official .deb for debian testing), https://packages.debian.org/bookworm/libopenmpi-dev

Please describe the system on which you are running

Operating system/version: Linux 5.16.7-2 (Debian GNU/Linux testing/bookworm)
Computer hardware: Intel Xeon Processor (Cascadelake) 16 cores
Network type: 100Gbs Ethernet interconnection

Also using

pmix 4.1.1 (libpmix2 4.1.1~rc6-1)
ucx 1.12.0 (libucx0 1.12.0-1)

Details of the problem

armci-mpi is currently failing to run across multiple nodes when built with OpenMPI. It runs fine on one node. The nodes form a cluster managed by openstack. 16 cpu per node.

The problem was reported for armci-mpi at https://github.com/pmodels/armci-mpi/issues/33 . armci-mpi does not fail when built with mpich, so armci-mpi developers attribute the error to a bug in OpenMPI RMA support.

armci-mpi tests pass when run on 1 node (with MPIEXEC = mpiexec -n 2). Most of the tests fail when run across two nodes (e.g. with MPIEXEC="mpiexec -H host-1:1,host-2:1 -n 2").

Running one of the failing tests manually (with or without ARMCI_USE_WIN_ALLOCATE) gives errors

$ mpirun.openmpi -H node-1:1,node-2:1 -n 2   tests/contrib/non-blocking/simple
[node-1:53732] *** An error occurred in MPI_Win_allocate
[node-1:53732] *** reported by process [2077097985,0]
[node-1:53732] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[node-1:53732] *** MPI_ERR_WIN: invalid window
[node-1:53732] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-1:53732] ***    and potentially your MPI job)
[node-1:53727] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[node-1:53727] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

or

$ ARMCI_USE_WIN_ALLOCATE=0  mpirun.openmpi -H node-1:1,node-2:1 -n 2   tests/contrib/non-blocking/simple
[node-1:53740] *** An error occurred in MPI_Win_create
[node-1:53740] *** reported by process [2079719425,0]
[node-1:53740] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[node-1:53740] *** MPI_ERR_WIN: invalid window
[node-1:53740] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-1:53740] ***    and potentially your MPI job)
[node-1:53735] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[node-1:53735] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Additionally, when an ARMCI_VERBOSE=1 environment variable is used, the test freezes after reporting configuration values, before reaching the crash point.

Is it feasible to include armci-mpi in OpenMPI CI testing, testing across 2 nodes ?

The problem was originally detected in the Debian build of nwchem (7.0.2-1 in debian testing), which uses armci-mpi Testing against the sample water script at https://nwchemgit.github.io/Sample.html, the nwchem error message is:

$ mpirun -H node-1:16,node-2:16 -N 16 nwchem water.nw 
[31] ARMCI assert fail in gmr_create() [src/gmr.c:109]: "alloc_slices[alloc_me].base != NULL"
[31] Backtrace:
[31]  10 - nwchem(+0x2836605) [0x55fe1ee26605]
[31]   9 - nwchem(+0x282cc1c) [0x55fe1ee1cc1c]
[31]   8 - nwchem(+0x282c358) [0x55fe1ee1c358]
[31]   7 - nwchem(+0x2819f68) [0x55fe1ee09f68]
[31]   6 - nwchem(+0x2819cba) [0x55fe1ee09cba]
[31]   5 - nwchem(+0x2819d76) [0x55fe1ee09d76]
[31]   4 - nwchem(+0x2818fe9) [0x55fe1ee08fe9]
[31]   3 - nwchem(+0x11b79) [0x55fe1c601b79]
[31]   2 - nwchem(+0x12659) [0x55fe1c602659]
[31]   1 - /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xcd) [0x7fb2c8ffa7ed]
[31]   0 - nwchem(+0x1069a) [0x55fe1c60069a]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 31 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: node-1
  Local PID:  1264980
  Peer host:  node-2
--------------------------------------------------------------------------

I've tried a fresh rebuild of armci-mpi, ga and nwchem against openmpi 4.1.2, but the failure is pervasive. nwchem runs successfully over multiple nodes when built against mpich.

drew-parsons commented 2 years ago

Following the discussion at https://github.com/open-mpi/ompi/issues/7813#issuecomment-644823020 , I find that my tests pass (both armci-mpi and nwchem) if I set the environment variable

OMPI_MCA_osc=ucx

(my system does not recognize OMPI_MCA_pml)

Is it expected that this setting should be required? If so, then it's just a runtime configuration issue, not bug.

drew-parsons commented 2 years ago

Worth also pointing out, however, even though nwchem/openmpi does run on 2 nodes with OMPI_MCA_osc=ucx, it's unusable.

A test case finishes on 1 node (16 cpu) in 4000 sec, around 1 hr. Running on 2 nodes (2×16 processes), I gave up and killed it after more than 122292.3 sec, nearly 2 days.

nwchem/armci-mpi built with mpich runs the same job over 2 nodes in 2800 sec.

devreal commented 2 years ago

@drew-parsons Thanks for the report! So it seems that there are multiple issues here: the non-ucx OSC component (osc/pt2pt I guess) failing to allocate the window and the UCX OSC either showing poor performance or getting stuck. I'm not sure why the osc/pt2pt component failed to allocate the window, running with --mca osc_base_verbose 100 could tell us more. Also, could you try the Open MPI 5.0 release branch per-chance? There have been plenty of improvements to the implementations in the upcoming release.

drew-parsons commented 2 years ago

I tried running mpirun with --mca osc_base_verbose 100 (without setting OMPI_MCA_osc=ucx), but it's not providing any more information. Just the same

[31] ARMCI assert fail in gmr_create() [src/gmr.c:109]: "alloc_slices[alloc_me].base != NULL"
[31] Backtrace:
[31]  10 - nwchem(+0x2836605) [0x55591b936605]
[31]   9 - nwchem(+0x282cc1c) [0x55591b92cc1c]
[31]   8 - nwchem(+0x282c358) [0x55591b92c358]
...

Time constraints make it difficult to test v5 for the time being.

ggouaillardet commented 2 years ago

@drew-parsons (most) debugging code is stripped out unless Open MPI is configure'd with --enable-debug. So I am afraid you have to manually rebuild Open MPI (and then your app since debug/non debug Open MPI builds might not be ABI compatible) in order to get useful logs.

Meanwhile, what if you

mpirun --mca pml ucx ...

or

mpirun --mca pml ob1 ...

@devreal isn't it a known issue where osc/pt2pt use btl/openib that might not be working?

drew-parsons commented 2 years ago

The pml options don't shed more light, unfortunately (without yet rebuilding openmpi):

$ mpirun.openmpi --mca pml ucx -H host-2:1,host-3:1 -n 2   tests/contrib/non-blocking/simple
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      host-2
  Framework: pml
--------------------------------------------------------------------------
[host-2:43451] PML ucx cannot be selected
[host-3:26169] PML ucx cannot be selected
[host-2:43446] 1 more process has sent help message help-mca-base.txt / find-available:none found
[host-2:43446] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

or

$ mpirun.openmpi --mca pml ob1 -H host-2:1,host-3:1 -n 2   tests/contrib/non-blocking/simple
[host-2:43460] *** An error occurred in MPI_Win_allocate
[host-2:43460] *** reported by process [2976317441,0]
[host-2:43460] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[host-2:43460] *** MPI_ERR_WIN: invalid window
[host-2:43460] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[host-2:43460] ***    and potentially your MPI job)
[host-2:43455] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[host-2:43455] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

ggouaillardet commented 2 years ago

This is very puzzling ... the osc/ucx component is there and usable (since OMPI_MCA_osc=ucx does not cause an error), but pml/ucx is either absent or unusable.

Can you

mpirun --mca pml_base_verbose 10 --mca pml ucx ...

and see it is sheds some light on why pml/ucx cannot be used?

open-mpi / ompi