OpenMPI hangs during allocation of shared memory if done after allgather

arunjose696 commented 11 months ago

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

I am using v5.0.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed from the below tarball

curl -O https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.0.tar.bz2
tar -jxf openmpi-5.0.0.tar.bz2export PATH=/localdisk/yigoshev/mpi/openmpi-5.0.0-built/bin:$PATH
cd openmpi-5.0.0/
./configure --prefix=<path_to_ompi>
make -j44 all
pip install sphinx_rtd_theme # for some reason openmpi requires this package to install
pip install recommonmark # for some reason openmpi requires this package to install
make -j44 all
make install
export PATH=<path_to_ompi>/bin:$PATH
pip install --no-cache-dir mpi4py

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Please describe the system on which you are running

Operating system/version:
Computer hardware:

Network type:


$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

-----------------------------

## Details of the problem

OpenMPI hangs when allocating shared memory on **Intel(R) Xeon(R) Platinum 8468**, when >= 128 processes are spawned, and an allgather() is present before shared memory allocation.

Further triage:
1)The code works fine for nprocs_to_spawn<128 and occurs only for high number of cpus.
2)The issue occurs when shared memory is allocated after a call to mpi allreduce. If this allreduce is commented issue is not occuring.
3)Issue is absent on other cpus(eg Intel(R) Xeon(R) Platinum 8276L)

```python
import mpi4py
from mpi4py import MPI  # noqa: E402
import sys

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
parent_comm = MPI.Comm.Get_parent()
info = MPI.Info.Create()

if parent_comm == MPI.COMM_NULL:
    nprocs_to_spawn = 128 # everything works on 127 and lower values
    args = ["reproducer.py"]

    intercomm = MPI.COMM_SELF.Spawn(
        sys.executable,
        args,
        maxprocs=nprocs_to_spawn,
        info=info,
        root=rank,
    )
    comm = intercomm.Merge(high=False)

if parent_comm != MPI.COMM_NULL:
    comm = parent_comm.Merge(high=True)

#Code works if the below allgather line is commented
ranks = comm.allgather(
             comm.Get_rank()
        )  
win = MPI.Win.Allocate_shared(
            100
            if rank==1
            else 0,
            MPI.BYTE.size,
            comm=comm,
            info=info,
        )

to run

mpiexec -n 1 --oversubscribe python reproducer.py

wenduwan commented 10 months ago

@arunjose696 Curious if you have tried 4.1.4/5/6 in addition to 5.0.0? It would be very helpful to determine the impact.

YarShev commented 10 months ago

@wenduwan, it also hangs on my side with 4.1.5.

hppritcha commented 10 months ago

I'm not sure why it is, but all of these related tests error out for me in the dpm cleanup code. Example with this one:

[st-master][[19684,1],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[st-master][[19684,1],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
1 more process has sent help message help-mca-bml-r2.txt / unreachable proc
[st-master][[19684,1],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[st-master:1498089] dpm_disconnect_init: error -12 in isend to process 3
[st-master:1498089] Error in comm_disconnect_waitall
[st-master:1498089:0:1498089] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid:1498089) ====
 0 0x000000000007a6e0 ompi_dpm_dyn_finalize()  :0
 1 0x000000000006120c ompi_comm_finalize()  :0
 2 0x000000000002fff8 opal_finalize_cleanup_domain()  :0
 3 0x0000000000025ddc opal_finalize()  :0
 4 0x00000000000912c8 ompi_rte_finalize()  :0
 5 0x0000000000098a10 ompi_mpi_instance_finalize_common()  :0
 6 0x000000000009a320 ompi_mpi_instance_finalize()  :0
 7 0x000000000008d868 ompi_mpi_finalize()  :0
 8 0x00000000000cc6d8 __pyx_f_6mpi4py_3MPI_atexit()  /users/hpritchard/mpi4py_sandbox/mpi4py/src/mpi4py/MPI.c:22520
 9 0x0000000000208970 Py_FinalizeEx()  ???:0
10 0x000000000020a128 Py_Main()  ???:0
11 0x0000000000000d08 main()  ???:0
12 0x0000000000024384 __libc_start_main()  :0
13 0x0000000000000ea0 _start()  ???:0
=================================
--------------------------------------------------------------------------

One thing I noticed is if your Open MPI build happened to find UCX and configure that in, rather than an abort, I''m seeing a hang.

I set

export OMPI_MCA_btl=^uct

and got what I'm reporting above. In previous responses to these test cases I had explicitly disabled ucx support and hence only saw this abort.

The problem appears to be that the dpm cleanup code is assuming all-to-all connectivity during the stage of Open MPI finalization where it was invoked.

hppritcha commented 10 months ago

The test does not hang for me, nor show this issue with the DPM cleancode in the 4.1.x branch (which is effectively 4.1.6).

arunjose696 commented 10 months ago

I tried with 4.1.6 from conda I could observe the same hang.

Did you try this on a Intel(R) Xeon(R) Platinum 8468 machine, because as mentioned earlier the test code in the issue passes for me on other cpus(eg Intel(R) Xeon(R) Platinum 8276L). Could this be a cpu related issue?

open-mpi / ompi