OMPI 4.0.1 TCP connection errors beyond 86 nodes

dfaraj commented 5 years ago

Thank you for taking the time to submit an issue!

Background information

we have an OPA cluster of 288 nodes. All nodes run same OS image, have passwordless ssh setup and firewall is disabled. We run basic OSU osu_mbw_mr tests on 2, 4, ...86 nodes and tests complete successfully. Once we hit 88+ nodes we get

ORTE has lost communication with a remote daemon.

  HNP daemon   : [[63011,0],0] on node r1i2n13
  Remote daemon: [[63011,0],40] on node r1i3n17

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   r1i2n13
  target node:  r1i2n14

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

4.0.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

downloaded 4.0.1 from openmpi site,

./configure --prefix=/store/dfaraj/SW/packages/ompi/4.0.1 CC=icc CXX=icpc FC=ifort --enable-orterun-prefix-by-default --enable-mpirun-prefix-by-default --with-psm2=/usr --without-verbs --without-psm --without-knem --without-slurm --without-ucx

Please describe the system on which you are running

Operating system/version: RH 7.6
Computer hardware: dual socket Xeon nodes
Network type: OPA

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

when we run: n=86

mpirun -mca  -x PATH -x LD_LIBRARY_PATH  -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi

it works fine, n=88

mpirun -mca  -x PATH -x LD_LIBRARY_PATH  -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi

we get the tcp error described earlier. if I do: n=88

mpirun -mca  -x PATH -x LD_LIBRARY_PATH --mca  plm_rsh_no_tree_spawn 1 -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi

it works. if I set n=160

mpirun -mca  -x PATH -x LD_LIBRARY_PATH --mca  plm_rsh_no_tree_spawn 1 -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi

it hangs, I dont think thu it is hanging, it is likely doing ssh to everyone and going so slow

EDIT: Put in proper verbatim markup

jsquyres commented 5 years ago

Can you try the latest 4.0.x nightly snapshot from https://www.open-mpi.org/nightly/v4.0.x/ ?

dfaraj commented 5 years ago

Jeff, it did not work unfortunately. I downloaded the latest nightly Jun 29 and built it. mpirun (Open MPI) 4.0.2a1 I get the below (with/without -x OMPI_MCA_routed=direct):

-bash-4.2$ n=88;cat $PBS_NODEFILE|uniq|head -n$n > myhosts; mpirun -v -x PATH -x LD_LIBRARY_PATH -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi4
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[10127,0],0] on node r1i0n3
  Remote daemon: [[10127,0],24] on node r1i0n27

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   r1i0n3
  target node:  r1i0n24

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.

EDIT: Added proper verbatim quoting

jsquyres commented 5 years ago

This seems to be related to #6198

ggouaillardet commented 5 years ago

@dfaraj did you build Open MPI with tm support ? if yes, you can do not need the -host ... option when invoking mpirun from a PBS script.

can you run dmesg on r1i0n27 and see if the orted daemon was killed or crashed ?

dfaraj commented 5 years ago

using the nightly build from July 8 I can now run with 120 nodes. but I get so many of "PSM Endpoint is closed or does not exist" at the end...

mpirun  -x PATH -x LD_LIBRARY_PATH  -np 120 -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi4
# OSU MPI Multiple Uni Bandwidth / Message Rate Test
# [ pairs: 60 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                     201.15      201150585.24
2                     402.90      201452499.81
4                     804.60      201150585.24
8                    1611.02      201376936.23
16                   2692.21      168262926.87
32                   5447.01      170219058.97
64                  11116.06      173688421.87
128                 21903.79      171123325.12
256                 42120.43      164532918.17
512                 77898.52      152145544.68
1024               138210.63      134971317.86
2048               216297.37      105613949.90
4096               233120.24       56914121.91
8192               227438.11       27763440.94
16384              224695.84       13714345.50
32768              223378.66        6816975.58
65536              223170.85        3405316.90
131072             223532.40        1705416.86
262144             224219.72         855330.36
524288             224360.32         427933.35
1048576            224607.08         214202.01
2097152            224046.23         106833.57
4194304            224024.17          53411.52
8388608            223812.17          26680.49
16777216           222752.49          13277.08
All processes entering MPI_Finalize
r1i1n0.369647PSM Endpoint is closed or does not exist
r1i1n11.172666PSM Endpoint is closed or does not exist

dfaraj commented 5 years ago

I have built the same OMPI using OFI instead of PSM2 directly and now that Endpoint error is gone. So I guess this serves as a work around. I would like to run this installation on 100+ nodes before I close this issue. Thanx guys so far.

nitinpatil1985 commented 5 years ago

Using Intel compiler 2018 and Openmpi Jul 09, 2019 Nightly Tarball.

Fetching the following errors undefined reference to `mpi_typeextent'

undefined reference to `mpi_typestruct'

I see to the previous posts seems using "--enable-mpi1-compatibility" solves the problem, but this option is no more supported in the recent version. Any option to get rid of this error!

ggouaillardet commented 5 years ago

I guess you are using the master branch. In this case you only have two options

modernize your code
use a release branch such as v4.0.x with the option you mentioned Code modernization is by far the best way.

dfaraj commented 5 years ago

more updates:

so I have 2 Xeon nodes (2 sockets, each socket has 20 cores) and each node has one hfi. I have built OMPI nightly build July 8 with OFI.

when I run with just 20 cores, things works, the moment I go beyond one socket, I get errors:

hpeopa1:~ dfaraj$ c=20; mpirun -mca orte_base_help_aggregate 0 -np $((2*c)) -npernode $c ./osu_mbw_mr.ompi
# OSU MPI Multiple Uni Bandwidth / Message Rate Test
# [ pairs: 20 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                      52.87       52872849.32
2                     105.39       52696398.90
4                     211.47       52867642.74
8                     421.24       52655052.18
16                    761.92       47620268.94
32                   1530.09       47815364.45
64                   3082.14       48158495.87
128                  5612.96       43851254.76
256                  7964.71       31112129.81
512                  8999.70       17577543.53
1024                 9613.64        9388317.08
2048                 9975.88        4871035.43
4096                10112.03        2468757.93
8192                10243.00        1250366.03
16384               11394.42         695460.17
32768               11380.27         347298.19
65536               11315.76         172664.78
131072              11382.10          86838.52
262144              11360.35          43336.28
^C^Chpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ c=21; mpirun -mca orte_base_help_aggregate 0 -np $((2*c)) -npernode $c ./osu_mbw_mr.ompi
hpeopa2.21415hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21415hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
hpeopa2.21415hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21415hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
hpeopa2.21415hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21415hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
hpeopa2.21415hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21415PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: hpeopa2
  Location: mtl_ofi_component.c:566
  Error: Invalid argument (22)
--------------------------------------------------------------------------
hpeopa2.21429hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21429hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
hpeopa2.21429hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21429hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
hpeopa2.21429hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21429hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
hpeopa2.21429hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21429PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: hpeopa2
  Location: mtl_ofi_component.c:566
  Error: Invalid argument (22)
--------------------------------------------------------------------------
hpeopa1.12797hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12797hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
hpeopa1.12797hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12797hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
hpeopa1.12797hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12797hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
hpeopa1.12797hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12797PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: hpeopa1
  Location: mtl_ofi_component.c:566
  Error: Invalid argument (22)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      hpeopa2
  Framework: pml
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      hpeopa2
  Framework: pml
--------------------------------------------------------------------------
hpeopa1.12772hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12772hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
hpeopa1.12772hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12772hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
hpeopa1.12772hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12772hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
hpeopa1.12772hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12772PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: hpeopa1
  Location: mtl_ofi_component.c:566
  Error: Invalid argument (22)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      hpeopa1
  Framework: pml
--------------------------------------------------------------------------
[hpeopa2:21415] PML cm cannot be selected
[hpeopa2:21429] PML cm cannot be selected
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      hpeopa1
  Framework: pml
--------------------------------------------------------------------------
[hpeopa1:12797] PML cm cannot be selected

nitinpatil1985 commented 5 years ago

From the above discussion openmpi is not working for 87++ nodes.

Is there any way to run openmpi 4.0.1 with -mca pml ucx on more than 100 nodes?

Can we use recently Nightly Master Tarball with the --enable-mpi1-compatibility option?

ggouaillardet commented 5 years ago

no

dfaraj commented 5 years ago

any updates?

jsquyres commented 5 years ago

I'm lost in the conversation here. Is the problem being discussed the TCP/SSH 86 process issue, or some PSM issue?

If this is no longer the TCP/SSH 86 process issue, this issue should be closed and a new issue should be opened to discuss the new question (please don't mix multiple topics on a single github issue -- thanks).

nitinpatil1985 commented 5 years ago

Is there any update on openmpi in which we can run our application for 87++ nodes and with the --enable-mpi1-compatibility option?

jsquyres commented 5 years ago

A lot of fixes have gone in to the v4.0.x branch in the run-time area. Can you try the latest v4.0x nightly snapshot?

For the MPI-1 compatibility, you should talk to your upstream application providers and (strongly) encourage them to upgrade their source code -- those APIs were deprecated in 1996, and were finally removed in 2012. It's time for them to stop being used.

FWIW: We're likely (but not guaranteed) to continue the MPI-1 compatibility in Open MPI v5.0 -- beyond that, I can't promise anything.

nitinpatil1985 commented 5 years ago

Thanks, Jeff for the information!

jjhursey commented 5 years ago

I posted a fix to the plm/rsh component that resolves a mismatch between the tree spawn and the remote routed component (see Issue #6618 for details). PR #6944 fixes the issue for the v4.0.x branch. Can you give that a try to see if it resolves this issue? I think it might help with the launch issue that was originally reported (probably not the PSM issue).

dfaraj commented 5 years ago

I just tested the latest nightly build and I no longer see this problem on OPA or EDR fabrics. Thanx guys for the fixes.

gpaulsen commented 5 years ago

Thanks for verifying @dfaraj!

open-mpi / ompi