Open satishpasumarthi opened 2 years ago
- Does the ORTEd process stay alive until the mpirun finishes the job ?
Yes, it does
- What happens if the ORTEd process on the non-leader machine dies in the middle ? Can the mpi process on leader still continue running ?
I'm afraid not - the death of any ORTEd process will cause automatic termination of all application processes.
- How are ssh and ORTEd linked? Does mpi rely on ssh daemon running on the machines?
It depends. If you are on a managed machine (e.g., Slurm), then we use the resource manager's native launcher (e.g., srun) to start the ORTEd processes. If not, then we rely on ssh for that purpose.
@satishpasumarthi Did this answer your question?
Hi @rhc54 @jsquyres , Thanks for your reply. I have some follow up questions on the ssh/ORTE process.
On the non-leader nodes, what is the recommended way to know if the mpirun has finished ? Can we rely on the ORTEd process ?
Does the ORTE process has anything to do with ssh timeouts or max connections (default 10) ? If the mpirun is running on > 10 nodes, will there be any issues because of ssh default setting of max connections ?
We are seeing ORTE has lost communication with a remote daemon
issues while using mpirun to run on large cluster of nodes. Tried various debug flags but nothing informative/conclusive.
When encountering above issue, if we issue the debug flag for mpirun -mca orte_debug 1
the issue doesn't occur. Is there any correlation with this?
@jsquyres : I've read some blogs of yours on openmpi and they were very informative. You should write more and more of them :-)
- On the non-leader nodes, what is the recommended way to know if the mpirun has finished ? Can we rely on the ORTEd process ?
I'm lost - your processes will terminate and then mpirun terminates, having seen all the application procs terminate. Am I missing something?
* Does the ORTE process has anything to do with ssh timeouts or max connections (default 10) ? If the mpirun is running on > 10 nodes, will there be any issues because of ssh default setting of max connections ?
You control the ssh fanout with the routed framework by setting OMPI_MCA_routed_radix=N
, where N is the desired fanout. I believe it defaults to 64, so you may need to dial it down. Each daemon that is launched "daemonizes" itself into the background, thus closing its ssh connection - and freeing it to be used by the next daemon to be launched.
* We are seeing `ORTE has lost communication with a remote daemon` issues while using mpirun to run on large cluster of nodes. Tried various debug flags but nothing informative/conclusive.
Usually that means you had a TCP communication failure between the nodes. You can add OMPI_MCA_oob_base_verbose=100
to get detailed output from the inter-daemon messaging system. Be prepared to be swamped!
* When encountering above issue, if we issue the debug flag for mpirun `-mca orte_debug 1` the issue doesn't occur. Is there any correlation with this?
When you set the debug flag, you also turn off the ssh fanout since we don't allow the daemons to "daemonize", which means that ssh remains connected. So you are limited in terms of how many daemons you can launch with debug turned on, constrained by the number of concurrent ssh sessions your OS will allow a process to have.
- On the non-leader nodes, what is the recommended way to know if the mpirun has finished ? Can we rely on the ORTEd process ?
As Ralph mentioned, on the non-mpirun nodes, when all of your individual MPI processes terminate, the local orted
will terminate. mpirun
should not terminate until all orted
s have terminated.
I'm not entirely clear on your question, either: by definition, mpirun
should be the last thing to terminate. Are you asking about some process monitoring the MPI job from outside of the MPI job?
- Does the ORTE process has anything to do with ssh timeouts or max connections (default 10) ?
Yes.
If the mpirun is running on > 10 nodes, will there be any issues because of ssh default setting of max connections ?
In the Open MPI v4.1.x series, ORTE should automatically use a tree-based ssh approach. Check out https://blogs.cisco.com/performance/tree-based-launch-in-open-mpi and https://blogs.cisco.com/performance/tree-based-launch-in-open-mpi-part-2 (which, per below, you may have seen already...!). The bottom line is that a max number of SSH connections from any one host shouldn't be a problem.
That being said, know that SSH is only used to launch ORTE daemons (orted
) across nodes. Once the peer orted
is launched, the SSH session closes, and the orted
continues to run in the background. Separate TCP sockets are opened between mpirun
and orted
s for communication, command, and control (note: it's not a fully-connected network -- skipping all those details here).
- We are seeing
ORTE has lost communication with a remote daemon
issues while using mpirun to run on large cluster of nodes. Tried various debug flags but nothing informative/conclusive.
This usually means that the orted
process was either killed (and therefore closed a TCP socket), or the TCP socket was otherwise closed (e.g., via a late firewall rule or somesuch).
Does this happen immediately when you launch? Or at some point randomly in the middle of the run? Or near the end of the run (e.g., during the shutdown/finalization sequence)?
- When encountering above issue, if we issue the debug flag for mpirun
-mca orte_debug 1
the issue doesn't occur. Is there any correlation with this?
Ralph answered this one.
@jsquyres : I've read some blogs of yours on openmpi and they were very informative. You should write more and more of them :-)
Thank you! I wish I had the time. 😄
Thank you @jsquyres and @rhc54 for your replies.
I'm not entirely clear on your question, either: by definition, mpirun
should be the last thing to terminate. Are you asking about some process monitoring the MPI job from outside of the MPI job?
Yes, I am talking about a monitoring process which performs a task on non-leader nodes once the mpirun has finished. To achieve this, is there a way to know that mpirun has completed (assuming the monitoring process on the non-leader node).
That being said, know that SSH is only used to launch ORTE daemons (orted
) across nodes. Once the peer orted
is launched, the SSH session closes, and the orted
continues to run in the background. Separate TCP sockets are opened between mpirun
and orted
s for communication, command, and control (note: it's not a fully-connected network -- skipping all those details here).
But we are using
-mca plm_rsh_no_tree_spawn 1
option that means the radix tree spawning is disabled, right?
Does this happen immediately when you launch? Or at some point randomly in the middle of the run? Or near the end of the run (e.g., during the shutdown/finalization sequence)?
It happens randomly at some point in the middle of the run.
Yes, I am talking about a monitoring process which performs a task on non-leader nodes once the mpirun has finished. To achieve this, is there a way to know that mpirun has completed (assuming the monitoring process on the non-leader node).
You can check for mpirun
to exit - it doesn't hang around once things are done.
But we are using -mca plm_rsh_no_tree_spawn 1 option that means the radix tree spawning is disabled, right?
Yes, that is correct. Note, however, that this limits the size of the job to the number of simultaneous ssh connections the system allows you to have.
It happens randomly at some point in the middle of the run.
Sounds like you have a flaky TCP connection. There are OMPI MCA params you can use that sometimes help with that situation. You can find them with ompi_info
(the following is from OMPI v4.0.x):
$ ompi_info --param oob all --level 9
MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.0.5)
MCA oob base: ---------------------------------------------------
MCA oob base: parameter "oob" (current value: "", data source:
default, level: 2 user/detail, type: string)
Default selection set of components for the oob
framework (<none> means use all components that can
be found)
MCA oob base: ---------------------------------------------------
MCA oob base: parameter "oob_base_verbose" (current value:
"error", data source: default, level: 8 dev/detail,
type: int)
Verbosity level for the oob framework (default: 0)
Valid values: -1:"none", 0:"error", 10:"component",
20:"warn", 40:"info", 60:"trace", 80:"debug",
100:"max", 0 - 100
MCA oob tcp: ---------------------------------------------------
MCA oob tcp: parameter "oob_tcp_peer_limit" (current value:
"-1", data source: default, level: 5 tuner/detail,
type: int)
Maximum number of peer connections to
simultaneously maintain (-1 = infinite)
MCA oob tcp: parameter "oob_tcp_peer_retries" (current value:
"2", data source: default, level: 5 tuner/detail,
type: int)
Number of times to try shutting down a connection
before giving up
MCA oob tcp: parameter "oob_tcp_sndbuf" (current value: "0",
data source: default, level: 4 tuner/basic, type:
int)
TCP socket send buffering size (in bytes, 0 =>
leave system default)
MCA oob tcp: parameter "oob_tcp_rcvbuf" (current value: "0",
data source: default, level: 4 tuner/basic, type:
int)
TCP socket receive buffering size (in bytes, 0 =>
leave system default)
MCA oob tcp: parameter "oob_tcp_if_include" (current value: "",
data source: default, level: 2 user/detail, type:
string, synonyms: oob_tcp_include)
Comma-delimited list of devices and/or CIDR
notation of TCP networks to use for Open MPI
bootstrap communication (e.g.,
"eth0,192.168.0.0/16"). Mutually exclusive with
oob_tcp_if_exclude.
MCA oob tcp: parameter "oob_tcp_if_exclude" (current value: "",
data source: default, level: 2 user/detail, type:
string, synonyms: oob_tcp_exclude)
Comma-delimited list of devices and/or CIDR
notation of TCP networks to NOT use for Open MPI
bootstrap communication -- all devices not matching
these specifications will be used (e.g.,
"eth0,192.168.0.0/16"). If set to a non-default
value, it is mutually exclusive with
oob_tcp_if_include.
MCA oob tcp: parameter "oob_tcp_static_ipv4_ports" (current
value: "", data source: default, level: 2
user/detail, type: string)
Static ports for daemons and procs (IPv4)
MCA oob tcp: parameter "oob_tcp_dynamic_ipv4_ports" (current
value: "", data source: default, level: 4
tuner/basic, type: string)
Range of ports to be dynamically used by daemons
and procs (IPv4)
MCA oob tcp: parameter "oob_tcp_disable_ipv4_family" (current
value: "false", data source: default, level: 4
tuner/basic, type: bool)
Disable the IPv4 interfaces
Valid values: 0: f|false|disabled|no|n, 1:
t|true|enabled|yes|y
MCA oob tcp: parameter "oob_tcp_keepalive_time" (current value:
"300", data source: default, level: 5 tuner/detail,
type: int)
Idle time in seconds before starting to send
keepalives (keepalive_time <= 0 disables keepalive
functionality)
MCA oob tcp: parameter "oob_tcp_keepalive_intvl" (current value:
"20", data source: default, level: 5 tuner/detail,
type: int)
Time between successive keepalive pings when peer
has not responded, in seconds (ignored if
keepalive_time <= 0)
MCA oob tcp: parameter "oob_tcp_keepalive_probes" (current
value: "9", data source: default, level: 5
tuner/detail, type: int)
Number of keepalives that can be missed before
declaring error (ignored if keepalive_time <= 0)
MCA oob tcp: parameter "oob_tcp_retry_delay" (current value:
"0", data source: default, level: 4 tuner/basic,
type: int)
Time (in sec) to wait before trying to connect to
peer again
MCA oob tcp: parameter "oob_tcp_max_recon_attempts" (current
value: "10", data source: default, level: 4
tuner/basic, type: int)
Max number of times to attempt connection before
giving up (-1 -> never give up)
The ones that usually help the most are:
MCA oob tcp: parameter "oob_tcp_keepalive_time" (current value:
"300", data source: default, level: 5 tuner/detail,
type: int)
Idle time in seconds before starting to send
keepalives (keepalive_time <= 0 disables keepalive
functionality)
MCA oob tcp: parameter "oob_tcp_keepalive_intvl" (current value:
"20", data source: default, level: 5 tuner/detail,
type: int)
Time between successive keepalive pings when peer
has not responded, in seconds (ignored if
keepalive_time <= 0)
MCA oob tcp: parameter "oob_tcp_keepalive_probes" (current
value: "9", data source: default, level: 5
tuner/detail, type: int)
Number of keepalives that can be missed before
declaring error (ignored if keepalive_time <= 0)
MCA oob tcp: parameter "oob_tcp_retry_delay" (current value:
"0", data source: default, level: 4 tuner/basic,
type: int)
Time (in sec) to wait before trying to connect to
peer again
MCA oob tcp: parameter "oob_tcp_max_recon_attempts" (current
value: "10", data source: default, level: 4
tuner/basic, type: int)
Max number of times to attempt connection before
giving up (-1 -> never give up)
You can check for mpirun to exit - it doesn't hang around once things are done.
Thanks for the suggestion. Is waiting for the ORTEd
process on the non-leader nodes same as checking for mpirun
on leader node ?
Yes, that is correct. Note, however, that this limits the size of the job to the number of simultaneous ssh connections the system allows you to have.
If there is no limitation on the number of simultaneous ssh
connections to the system and if -mca plm_rsh_no_tree_spawn 1
is specified, then it would increase the start up time of the run as the establishment of ssh
connections is sequential. Am I right in my understanding ?
Also, is there any recommendation as to when to use the tree based ssh
vs using sequential (plm_rsh_no_tree_spawn
0 vs 1)
Thanks for the suggestion. Is waiting for the ORTEd process on the non-leader nodes same as checking for mpirun on leader node ?
Not exactly. Each ORTEd notifies mpirun
when its local procs terminate (normally or error). Once all ORTEds have notified, mpirun
broadcasts a "die" message to the ORTEds. When received, each ORTEd relays the message down the routing tree (if one exists) and then exits.
Key difference is that only mpirun
knows the exit status of the job you ran, so only mpirun
will exit with that status.
If there is no limitation on the number of simultaneous ssh connections to the system and if -mca plm_rsh_no_tree_spawn 1 is specified, then it would increase the start up time of the run as the establishment of ssh connections is sequential. Am I right in my understanding ?
Correct - it is the drawback of that approach.
Also, is there any recommendation as to when to use the tree based ssh vs using sequential (plm_rsh_no_tree_spawn 0 vs 1)
Generally, it is always best to use the tree spawn - we only turn it off when trying to debug daemons as we don't want any debug messages lost if a daemon dies. In that case, we aren't concerned about scale or launch speed.
Thank you @rhc54 for your prompt reply.
Generally, it is always best to use the tree spawn - we only turn it off when trying to debug daemons as we don't want any debug messages lost if a daemon dies. In that case, we aren't concerned about scale or launch speed.
So apart from debugging, it is always recommended to use the tree spawn. Thanks !
If an ORTEd process on the tree at non-leaf level terminates, does the main mpirun
process know and log the exact termination message as it would do on a non-tree spawn model ?
If an ORTEd process on the tree at non-leaf level terminates, does the main
mpirun
process know and log the exact termination message as it would do on a non-tree spawn model ?
mpirun
will be told about it, regardless of whether or not the ORTEd is on a leaf. mpirun
does get the basic error message (e.g., "lost contact"), but I'm not sure what you mean by an "exact" termination message. We rarely have any way of knowing why the ORTEd vanished beyond "lost connection".
If you are looking to see if the ORTEd actually output an error message before terminating, then you need to hold the ssh session open - usually you do that by adding --leave-session-attached
to the mpirun
cmd line. Again, that will limit your job size as it turns off tree spawn by holding the ssh sessions open throughout the job execution. Note that you may or may not see anything depending upon what causes the ORTEd to terminate.
You may want to look in the system logs of machines where you lost contact with an orted
. Sometimes an HPC job will suck up too much memory, and the OOM killer will start killing processes. This can sometimes randomly kill the orted
, which can result in what you're seeing.
Looking through the system logs may provide some insight on if an external influence (e.g., the OOM killer) caused an orted
or MPI process to die unexpectedly.
Hi @jsquyres and @rhc54 , Sorry for delayed response on this.
We made an interesting observation that the leader node continues execution even though the ORTEd
on the non-leader node dies. Also we have seen that a file is being written onto /tmp
directory in that node with the name prefix <node-name>
. Trying to dig up more details and contents of that file. Will share here as soon as I get to the contents of that file.
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
It was installed from source using make install.
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:
I have few questions regarding the ORTE daemon process that gets created on the non-leader machines during an mpi run. 1) Does the ORTEd process stay alive until the mpirun finishes the job ? 2) What happens if the ORTEd process on the non-leader machine dies in the middle ? Can the mpi process on leader still continue running ? 3) How are ssh and ORTEd linked? Does mpi rely on ssh daemon running on the machines?