open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 860 forks source link

Error on cluster over WAN: ORTE does not know how to route a message to the specified daemon #7630

Open langfield opened 4 years ago

langfield commented 4 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.0.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From tarball.

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running


Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Attempting to test 2-node cluster over WAN. Passwordless SSH is set up between both nodes via RSA, going both ways. Hostfile is as follows:

localhost

cc-1

SSH access is on non-standard ports, so hostnames are aliased in .ssh/config a bit like:

Host cc-1
    HostName <hostname>
    Port <not_22>
    User root

The error message below seems a bit cryptic. I'm not sure why manual SSH works between the nodes but OMPI is unable to connect. Any help appreciated.

# mpirun --allow-run-as-root -np 2 -hostfile .mpi_hostfile --mca routed direct a.out
Warning: Permanently added the ECDSA host key for IP address '[3.17.117.250]:19972' to the list of known hosts.^M
orted: Error: unknown option "--tree-spawn"
Type 'orted --help' for usage.
Usage: orted [OPTION]...
   -am <arg0>            Aggregate MCA parameter set file list
-d|--debug               Debug the OpenRTE
   --daemonize           Daemonize the orted into the background
   --debug-daemons       Enable debugging of OpenRTE daemons
   --debug-daemons-file  Enable debugging of OpenRTE daemons, storing output
                         in files
   -gmca|--gmca <arg0> <arg1>
                         Pass global MCA parameters that are applicable to
                         all contexts (arg0 is the parameter name; arg1 is
                         the parameter value)
-h|--help                This help message
   --hetero-nodes        Nodes in cluster may differ in topology, so send
                         the topology back from each node [Default = false]
   --hnp                 Direct the orted to act as the HNP
   --hnp-topo-sig <arg0>
                         Topology signature of HNP
   --hnp-uri <arg0>      URI for the HNP
   -mapreduce|--mapreduce
                         Whether to report process bindings to stderr
   -mca|--mca <arg0> <arg1>
                         Pass context-specific MCA parameters; they are
                         considered global if --gmca is not used and only
                         one context is specified (arg0 is the parameter
                         name; arg1 is the parameter value)
   -nodes|--nodes <arg0>
                         Regular expression defining nodes in system
   -output-filename|--output-filename <arg0>
                         Redirect output from application processes into
                         filename.rank
   --parent-uri <arg0>   URI for the parent if tree launch is enabled.
   -report-bindings|--report-bindings
                         Whether to report process bindings to stderr
   --report-uri <arg0>   Report this process' uri on indicated pipe
-s|--spin                Have the orted spin until we can connect a debugger
                         to it
   --set-sid             Direct the orted to separate from the current
                         session
   --singleton-died-pipe <arg0>
                         Watch on indicated pipe for singleton termination
   --test-suicide <arg0>
                         Suicide instead of clean abort after delay
   --tmpdir <arg0>       Set the root for the session directory tree
   -tune <arg0>          Application profile options file list
   -xterm|--xterm <arg0>
                         Create a new xterm window and display output from
                         the specified ranks there

--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   cc-0
  target node:  cc-1

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
[cc-0:189803] 1 more process has sent help message help-errmgr-base.txt / no-path
[cc-0:189803] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
rhc54 commented 4 years ago

Looks like you have a mismatched version on one of your nodes:

orted: Error: unknown option "--tree-spawn"

Only an old version of OMPI would not understand that cmd line option.

langfield commented 4 years ago

Thanks for replying. I've reinstalled on all nodes, and the unknown option error is gone, but the rest of the output is the same error (ORTE can't route from the head node to the target node).