Open arunjose696 opened 11 months ago
@arunjose696 Curious if you have tried 4.1.4/5/6 in addition to 5.0.0? It would be very helpful to determine the impact.
@wenduwan, it also hangs on my side with 4.1.5.
I'm not sure why it is, but all of these related tests error out for me in the dpm cleanup code. Example with this one:
[st-master][[19684,1],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[st-master][[19684,1],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
1 more process has sent help message help-mca-bml-r2.txt / unreachable proc
[st-master][[19684,1],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[st-master:1498089] dpm_disconnect_init: error -12 in isend to process 3
[st-master:1498089] Error in comm_disconnect_waitall
[st-master:1498089:0:1498089] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid:1498089) ====
0 0x000000000007a6e0 ompi_dpm_dyn_finalize() :0
1 0x000000000006120c ompi_comm_finalize() :0
2 0x000000000002fff8 opal_finalize_cleanup_domain() :0
3 0x0000000000025ddc opal_finalize() :0
4 0x00000000000912c8 ompi_rte_finalize() :0
5 0x0000000000098a10 ompi_mpi_instance_finalize_common() :0
6 0x000000000009a320 ompi_mpi_instance_finalize() :0
7 0x000000000008d868 ompi_mpi_finalize() :0
8 0x00000000000cc6d8 __pyx_f_6mpi4py_3MPI_atexit() /users/hpritchard/mpi4py_sandbox/mpi4py/src/mpi4py/MPI.c:22520
9 0x0000000000208970 Py_FinalizeEx() ???:0
10 0x000000000020a128 Py_Main() ???:0
11 0x0000000000000d08 main() ???:0
12 0x0000000000024384 __libc_start_main() :0
13 0x0000000000000ea0 _start() ???:0
=================================
--------------------------------------------------------------------------
One thing I noticed is if your Open MPI build happened to find UCX and configure that in, rather than an abort, I''m seeing a hang.
I set
export OMPI_MCA_btl=^uct
and got what I'm reporting above. In previous responses to these test cases I had explicitly disabled ucx support and hence only saw this abort.
The problem appears to be that the dpm cleanup code is assuming all-to-all connectivity during the stage of Open MPI finalization where it was invoked.
The test does not hang for me, nor show this issue with the DPM cleancode in the 4.1.x branch (which is effectively 4.1.6).
I tried with 4.1.6 from conda I could observe the same hang.
Did you try this on a Intel(R) Xeon(R) Platinum 8468 machine, because as mentioned earlier the test code in the issue passes for me on other cpus(eg Intel(R) Xeon(R) Platinum 8276L). Could this be a cpu related issue?
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
I am using v5.0.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Installed from the below tarball
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
to run