Closed dfaraj closed 5 years ago
Can you try the latest 4.0.x nightly snapshot from https://www.open-mpi.org/nightly/v4.0.x/ ?
Jeff, it did not work unfortunately. I downloaded the latest nightly Jun 29 and built it. mpirun (Open MPI) 4.0.2a1 I get the below (with/without -x OMPI_MCA_routed=direct):
-bash-4.2$ n=88;cat $PBS_NODEFILE|uniq|head -n$n > myhosts; mpirun -v -x PATH -x LD_LIBRARY_PATH -np $((n)) -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi4
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[10127,0],0] on node r1i0n3
Remote daemon: [[10127,0],24] on node r1i0n27
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
my node: r1i0n3
target node: r1i0n24
This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
EDIT: Added proper verbatim quoting
This seems to be related to #6198
@dfaraj did you build Open MPI with tm
support ?
if yes, you can do not need the -host ...
option when invoking mpirun
from a PBS script.
can you run dmesg
on r1i0n27
and see if the orted
daemon was killed or crashed ?
using the nightly build from July 8 I can now run with 120 nodes. but I get so many of "PSM Endpoint is closed or does not exist" at the end...
mpirun -x PATH -x LD_LIBRARY_PATH -np 120 -map-by ppr:1:node -hostfile myhosts ./osu_mbw_mr.ompi4
# OSU MPI Multiple Uni Bandwidth / Message Rate Test
# [ pairs: 60 ] [ window size: 64 ]
# Size MB/s Messages/s
1 201.15 201150585.24
2 402.90 201452499.81
4 804.60 201150585.24
8 1611.02 201376936.23
16 2692.21 168262926.87
32 5447.01 170219058.97
64 11116.06 173688421.87
128 21903.79 171123325.12
256 42120.43 164532918.17
512 77898.52 152145544.68
1024 138210.63 134971317.86
2048 216297.37 105613949.90
4096 233120.24 56914121.91
8192 227438.11 27763440.94
16384 224695.84 13714345.50
32768 223378.66 6816975.58
65536 223170.85 3405316.90
131072 223532.40 1705416.86
262144 224219.72 855330.36
524288 224360.32 427933.35
1048576 224607.08 214202.01
2097152 224046.23 106833.57
4194304 224024.17 53411.52
8388608 223812.17 26680.49
16777216 222752.49 13277.08
All processes entering MPI_Finalize
r1i1n0.369647PSM Endpoint is closed or does not exist
r1i1n11.172666PSM Endpoint is closed or does not exist
I have built the same OMPI using OFI instead of PSM2 directly and now that Endpoint error is gone. So I guess this serves as a work around. I would like to run this installation on 100+ nodes before I close this issue. Thanx guys so far.
Using Intel compiler 2018 and Openmpi Jul 09, 2019 Nightly Tarball.
Fetching the following errors undefined reference to `mpi_typeextent'
undefined reference to `mpi_typestruct'
I see to the previous posts seems using "--enable-mpi1-compatibility" solves the problem, but this option is no more supported in the recent version. Any option to get rid of this error!
I guess you are using the master
branch. In this case you only have two options
v4.0.x
with the option you mentioned
Code modernization is by far the best way.more updates:
so I have 2 Xeon nodes (2 sockets, each socket has 20 cores) and each node has one hfi. I have built OMPI nightly build July 8 with OFI.
when I run with just 20 cores, things works, the moment I go beyond one socket, I get errors:
hpeopa1:~ dfaraj$ c=20; mpirun -mca orte_base_help_aggregate 0 -np $((2*c)) -npernode $c ./osu_mbw_mr.ompi
# OSU MPI Multiple Uni Bandwidth / Message Rate Test
# [ pairs: 20 ] [ window size: 64 ]
# Size MB/s Messages/s
1 52.87 52872849.32
2 105.39 52696398.90
4 211.47 52867642.74
8 421.24 52655052.18
16 761.92 47620268.94
32 1530.09 47815364.45
64 3082.14 48158495.87
128 5612.96 43851254.76
256 7964.71 31112129.81
512 8999.70 17577543.53
1024 9613.64 9388317.08
2048 9975.88 4871035.43
4096 10112.03 2468757.93
8192 10243.00 1250366.03
16384 11394.42 695460.17
32768 11380.27 347298.19
65536 11315.76 172664.78
131072 11382.10 86838.52
262144 11360.35 43336.28
^C^Chpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ ^C
hpeopa1:~ dfaraj$ c=21; mpirun -mca orte_base_help_aggregate 0 -np $((2*c)) -npernode $c ./osu_mbw_mr.ompi
hpeopa2.21415hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21415hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
hpeopa2.21415hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21415hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
hpeopa2.21415hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21415hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
hpeopa2.21415hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21415PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: hpeopa2
Location: mtl_ofi_component.c:566
Error: Invalid argument (22)
--------------------------------------------------------------------------
hpeopa2.21429hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21429hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
hpeopa2.21429hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21429hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
hpeopa2.21429hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21429hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
hpeopa2.21429hfi_userinit: assign_context command failed: Device or resource busy
hpeopa2.21429PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: hpeopa2
Location: mtl_ofi_component.c:566
Error: Invalid argument (22)
--------------------------------------------------------------------------
hpeopa1.12797hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12797hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
hpeopa1.12797hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12797hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
hpeopa1.12797hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12797hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
hpeopa1.12797hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12797PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: hpeopa1
Location: mtl_ofi_component.c:566
Error: Invalid argument (22)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.
This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.
Host: hpeopa2
Framework: pml
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.
This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.
Host: hpeopa2
Framework: pml
--------------------------------------------------------------------------
hpeopa1.12772hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12772hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
hpeopa1.12772hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12772hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
hpeopa1.12772hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12772hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
hpeopa1.12772hfi_userinit: assign_context command failed: Device or resource busy
hpeopa1.12772PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: hpeopa1
Location: mtl_ofi_component.c:566
Error: Invalid argument (22)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.
This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.
Host: hpeopa1
Framework: pml
--------------------------------------------------------------------------
[hpeopa2:21415] PML cm cannot be selected
[hpeopa2:21429] PML cm cannot be selected
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.
This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.
Host: hpeopa1
Framework: pml
--------------------------------------------------------------------------
[hpeopa1:12797] PML cm cannot be selected
From the above discussion openmpi is not working for 87++ nodes.
Is there any way to run openmpi 4.0.1 with -mca pml ucx on more than 100 nodes?
Can we use recently Nightly Master Tarball with the --enable-mpi1-compatibility option?
no
any updates?
I'm lost in the conversation here. Is the problem being discussed the TCP/SSH 86 process issue, or some PSM issue?
If this is no longer the TCP/SSH 86 process issue, this issue should be closed and a new issue should be opened to discuss the new question (please don't mix multiple topics on a single github issue -- thanks).
Is there any update on openmpi in which we can run our application for 87++ nodes and with the --enable-mpi1-compatibility option?
A lot of fixes have gone in to the v4.0.x branch in the run-time area. Can you try the latest v4.0x nightly snapshot?
For the MPI-1 compatibility, you should talk to your upstream application providers and (strongly) encourage them to upgrade their source code -- those APIs were deprecated in 1996, and were finally removed in 2012. It's time for them to stop being used.
FWIW: We're likely (but not guaranteed) to continue the MPI-1 compatibility in Open MPI v5.0 -- beyond that, I can't promise anything.
Thanks, Jeff for the information!
I posted a fix to the plm/rsh
component that resolves a mismatch between the tree spawn and the remote routed component (see Issue #6618 for details). PR #6944 fixes the issue for the v4.0.x branch. Can you give that a try to see if it resolves this issue? I think it might help with the launch issue that was originally reported (probably not the PSM issue).
I just tested the latest nightly build and I no longer see this problem on OPA or EDR fabrics. Thanx guys for the fixes.
Thanks for verifying @dfaraj!
Thank you for taking the time to submit an issue!
Background information
we have an OPA cluster of 288 nodes. All nodes run same OS image, have passwordless ssh setup and firewall is disabled. We run basic OSU osu_mbw_mr tests on 2, 4, ...86 nodes and tests complete successfully. Once we hit 88+ nodes we get
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
4.0.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
downloaded 4.0.1 from openmpi site,
Please describe the system on which you are running
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
when we run: n=86
it works fine, n=88
we get the tcp error described earlier. if I do: n=88
it works. if I set n=160
it hangs, I dont think thu it is hanging, it is likely doing ssh to everyone and going so slow
EDIT: Put in proper verbatim markup