[Question] ORTE daemon behavior on non-leader machines

satishpasumarthi commented 2 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

It was installed from source using make install.

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Please describe the system on which you are running

Operating system/version: Ubuntu
Computer hardware:
Network type:

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -n 2 ./hello_world

I have few questions regarding the ORTE daemon process that gets created on the non-leader machines during an mpi run. 1) Does the ORTEd process stay alive until the mpirun finishes the job ? 2) What happens if the ORTEd process on the non-leader machine dies in the middle ? Can the mpi process on leader still continue running ? 3) How are ssh and ORTEd linked? Does mpi rely on ssh daemon running on the machines?

rhc54 commented 2 years ago

Does the ORTEd process stay alive until the mpirun finishes the job ?

Yes, it does

What happens if the ORTEd process on the non-leader machine dies in the middle ? Can the mpi process on leader still continue running ?

I'm afraid not - the death of any ORTEd process will cause automatic termination of all application processes.

How are ssh and ORTEd linked? Does mpi rely on ssh daemon running on the machines?

It depends. If you are on a managed machine (e.g., Slurm), then we use the resource manager's native launcher (e.g., srun) to start the ORTEd processes. If not, then we rely on ssh for that purpose.

jsquyres commented 2 years ago

@satishpasumarthi Did this answer your question?

satishpasumarthi commented 2 years ago

Hi @rhc54 @jsquyres , Thanks for your reply. I have some follow up questions on the ssh/ORTE process.

On the non-leader nodes, what is the recommended way to know if the mpirun has finished ? Can we rely on the ORTEd process ?
Does the ORTE process has anything to do with ssh timeouts or max connections (default 10) ? If the mpirun is running on > 10 nodes, will there be any issues because of ssh default setting of max connections ?
We are seeing ORTE has lost communication with a remote daemon issues while using mpirun to run on large cluster of nodes. Tried various debug flags but nothing informative/conclusive.
When encountering above issue, if we issue the debug flag for mpirun -mca orte_debug 1 the issue doesn't occur. Is there any correlation with this?

@jsquyres : I've read some blogs of yours on openmpi and they were very informative. You should write more and more of them :-)

rhc54 commented 2 years ago

On the non-leader nodes, what is the recommended way to know if the mpirun has finished ? Can we rely on the ORTEd process ?

I'm lost - your processes will terminate and then mpirun terminates, having seen all the application procs terminate. Am I missing something?

* Does the ORTE process has anything to do with ssh timeouts or max connections (default 10) ? If the mpirun is running on > 10 nodes, will there be any issues because of ssh default setting of max connections ?

You control the ssh fanout with the routed framework by setting OMPI_MCA_routed_radix=N, where N is the desired fanout. I believe it defaults to 64, so you may need to dial it down. Each daemon that is launched "daemonizes" itself into the background, thus closing its ssh connection - and freeing it to be used by the next daemon to be launched.

* We are seeing `ORTE has lost communication with a remote daemon`  issues while using mpirun to run on large cluster of nodes. Tried various debug flags but nothing informative/conclusive.

Usually that means you had a TCP communication failure between the nodes. You can add OMPI_MCA_oob_base_verbose=100 to get detailed output from the inter-daemon messaging system. Be prepared to be swamped!

* When encountering above issue, if we issue the debug flag for mpirun `-mca orte_debug 1` the issue doesn't occur. Is there any correlation with this?

When you set the debug flag, you also turn off the ssh fanout since we don't allow the daemons to "daemonize", which means that ssh remains connected. So you are limited in terms of how many daemons you can launch with debug turned on, constrained by the number of concurrent ssh sessions your OS will allow a process to have.

jsquyres commented 2 years ago

On the non-leader nodes, what is the recommended way to know if the mpirun has finished ? Can we rely on the ORTEd process ?

As Ralph mentioned, on the non-mpirun nodes, when all of your individual MPI processes terminate, the local orted will terminate. mpirun should not terminate until all orteds have terminated.

I'm not entirely clear on your question, either: by definition, mpirun should be the last thing to terminate. Are you asking about some process monitoring the MPI job from outside of the MPI job?

Does the ORTE process has anything to do with ssh timeouts or max connections (default 10) ?

Yes.

If the mpirun is running on > 10 nodes, will there be any issues because of ssh default setting of max connections ?

In the Open MPI v4.1.x series, ORTE should automatically use a tree-based ssh approach. Check out https://blogs.cisco.com/performance/tree-based-launch-in-open-mpi and https://blogs.cisco.com/performance/tree-based-launch-in-open-mpi-part-2 (which, per below, you may have seen already...!). The bottom line is that a max number of SSH connections from any one host shouldn't be a problem.

That being said, know that SSH is only used to launch ORTE daemons (orted) across nodes. Once the peer orted is launched, the SSH session closes, and the orted continues to run in the background. Separate TCP sockets are opened between mpirun and orteds for communication, command, and control (note: it's not a fully-connected network -- skipping all those details here).

We are seeing ORTE has lost communication with a remote daemon issues while using mpirun to run on large cluster of nodes. Tried various debug flags but nothing informative/conclusive.

This usually means that the orted process was either killed (and therefore closed a TCP socket), or the TCP socket was otherwise closed (e.g., via a late firewall rule or somesuch).

Does this happen immediately when you launch? Or at some point randomly in the middle of the run? Or near the end of the run (e.g., during the shutdown/finalization sequence)?

When encountering above issue, if we issue the debug flag for mpirun -mca orte_debug 1 the issue doesn't occur. Is there any correlation with this?

Ralph answered this one.

@jsquyres : I've read some blogs of yours on openmpi and they were very informative. You should write more and more of them :-)

Thank you! I wish I had the time. 😄

satishpasumarthi commented 2 years ago

Thank you @jsquyres and @rhc54 for your replies.

I'm not entirely clear on your question, either: by definition, mpirun should be the last thing to terminate. Are you asking about some process monitoring the MPI job from outside of the MPI job?

Yes, I am talking about a monitoring process which performs a task on non-leader nodes once the mpirun has finished. To achieve this, is there a way to know that mpirun has completed (assuming the monitoring process on the non-leader node).

That being said, know that SSH is only used to launch ORTE daemons (orted) across nodes. Once the peer orted is launched, the SSH session closes, and the orted continues to run in the background. Separate TCP sockets are opened between mpirun and orteds for communication, command, and control (note: it's not a fully-connected network -- skipping all those details here).

But we are using -mca plm_rsh_no_tree_spawn 1 option that means the radix tree spawning is disabled, right?

Does this happen immediately when you launch? Or at some point randomly in the middle of the run? Or near the end of the run (e.g., during the shutdown/finalization sequence)?

It happens randomly at some point in the middle of the run.

rhc54 commented 2 years ago

Yes, I am talking about a monitoring process which performs a task on non-leader nodes once the mpirun has finished. To achieve this, is there a way to know that mpirun has completed (assuming the monitoring process on the non-leader node).

You can check for mpirun to exit - it doesn't hang around once things are done.

But we are using -mca plm_rsh_no_tree_spawn 1 option that means the radix tree spawning is disabled, right?

Yes, that is correct. Note, however, that this limits the size of the job to the number of simultaneous ssh connections the system allows you to have.

It happens randomly at some point in the middle of the run.

Sounds like you have a flaky TCP connection. There are OMPI MCA params you can use that sometimes help with that situation. You can find them with ompi_info (the following is from OMPI v4.0.x):

$  ompi_info --param oob all --level 9
                 MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.0.5)
            MCA oob base: ---------------------------------------------------
            MCA oob base: parameter "oob" (current value: "", data source:
                          default, level: 2 user/detail, type: string)
                          Default selection set of components for the oob
                          framework (<none> means use all components that can
                          be found)
            MCA oob base: ---------------------------------------------------
            MCA oob base: parameter "oob_base_verbose" (current value:
                          "error", data source: default, level: 8 dev/detail,
                          type: int)
                          Verbosity level for the oob framework (default: 0)
                          Valid values: -1:"none", 0:"error", 10:"component",
                          20:"warn", 40:"info", 60:"trace", 80:"debug",
                          100:"max", 0 - 100
             MCA oob tcp: ---------------------------------------------------
             MCA oob tcp: parameter "oob_tcp_peer_limit" (current value:
                          "-1", data source: default, level: 5 tuner/detail,
                          type: int)
                          Maximum number of peer connections to
                          simultaneously maintain (-1 = infinite)
             MCA oob tcp: parameter "oob_tcp_peer_retries" (current value:
                          "2", data source: default, level: 5 tuner/detail,
                          type: int)
                          Number of times to try shutting down a connection
                          before giving up
             MCA oob tcp: parameter "oob_tcp_sndbuf" (current value: "0",
                          data source: default, level: 4 tuner/basic, type:
                          int)
                          TCP socket send buffering size (in bytes, 0 =>
                          leave system default)
             MCA oob tcp: parameter "oob_tcp_rcvbuf" (current value: "0",
                          data source: default, level: 4 tuner/basic, type:
                          int)
                          TCP socket receive buffering size (in bytes, 0 =>
                          leave system default)
             MCA oob tcp: parameter "oob_tcp_if_include" (current value: "",
                          data source: default, level: 2 user/detail, type:
                          string, synonyms: oob_tcp_include)
                          Comma-delimited list of devices and/or CIDR
                          notation of TCP networks to use for Open MPI
                          bootstrap communication (e.g.,
                          "eth0,192.168.0.0/16").  Mutually exclusive with
                          oob_tcp_if_exclude.
             MCA oob tcp: parameter "oob_tcp_if_exclude" (current value: "",
                          data source: default, level: 2 user/detail, type:
                          string, synonyms: oob_tcp_exclude)
                          Comma-delimited list of devices and/or CIDR
                          notation of TCP networks to NOT use for Open MPI
                          bootstrap communication -- all devices not matching
                          these specifications will be used (e.g.,
                          "eth0,192.168.0.0/16").  If set to a non-default
                          value, it is mutually exclusive with
                          oob_tcp_if_include.
             MCA oob tcp: parameter "oob_tcp_static_ipv4_ports" (current
                          value: "", data source: default, level: 2
                          user/detail, type: string)
                          Static ports for daemons and procs (IPv4)
             MCA oob tcp: parameter "oob_tcp_dynamic_ipv4_ports" (current
                          value: "", data source: default, level: 4
                          tuner/basic, type: string)
                          Range of ports to be dynamically used by daemons
                          and procs (IPv4)
             MCA oob tcp: parameter "oob_tcp_disable_ipv4_family" (current
                          value: "false", data source: default, level: 4
                          tuner/basic, type: bool)
                          Disable the IPv4 interfaces
                          Valid values: 0: f|false|disabled|no|n, 1:
                          t|true|enabled|yes|y
             MCA oob tcp: parameter "oob_tcp_keepalive_time" (current value:
                          "300", data source: default, level: 5 tuner/detail,
                          type: int)
                          Idle time in seconds before starting to send
                          keepalives (keepalive_time <= 0 disables keepalive
                          functionality)
             MCA oob tcp: parameter "oob_tcp_keepalive_intvl" (current value:
                          "20", data source: default, level: 5 tuner/detail,
                          type: int)
                          Time between successive keepalive pings when peer
                          has not responded, in seconds (ignored if
                          keepalive_time <= 0)
             MCA oob tcp: parameter "oob_tcp_keepalive_probes" (current
                          value: "9", data source: default, level: 5
                          tuner/detail, type: int)
                          Number of keepalives that can be missed before
                          declaring error (ignored if keepalive_time <= 0)
             MCA oob tcp: parameter "oob_tcp_retry_delay" (current value:
                          "0", data source: default, level: 4 tuner/basic,
                          type: int)
                          Time (in sec) to wait before trying to connect to
                          peer again
             MCA oob tcp: parameter "oob_tcp_max_recon_attempts" (current
                          value: "10", data source: default, level: 4
                          tuner/basic, type: int)
                          Max number of times to attempt connection before
                          giving up (-1 -> never give up)

The ones that usually help the most are:

             MCA oob tcp: parameter "oob_tcp_keepalive_time" (current value:
                          "300", data source: default, level: 5 tuner/detail,
                          type: int)
                          Idle time in seconds before starting to send
                          keepalives (keepalive_time <= 0 disables keepalive
                          functionality)
             MCA oob tcp: parameter "oob_tcp_keepalive_intvl" (current value:
                          "20", data source: default, level: 5 tuner/detail,
                          type: int)
                          Time between successive keepalive pings when peer
                          has not responded, in seconds (ignored if
                          keepalive_time <= 0)
             MCA oob tcp: parameter "oob_tcp_keepalive_probes" (current
                          value: "9", data source: default, level: 5
                          tuner/detail, type: int)
                          Number of keepalives that can be missed before
                          declaring error (ignored if keepalive_time <= 0)
             MCA oob tcp: parameter "oob_tcp_retry_delay" (current value:
                          "0", data source: default, level: 4 tuner/basic,
                          type: int)
                          Time (in sec) to wait before trying to connect to
                          peer again
             MCA oob tcp: parameter "oob_tcp_max_recon_attempts" (current
                          value: "10", data source: default, level: 4
                          tuner/basic, type: int)
                          Max number of times to attempt connection before
                          giving up (-1 -> never give up)

satishpasumarthi commented 2 years ago

You can check for mpirun to exit - it doesn't hang around once things are done.

Thanks for the suggestion. Is waiting for the ORTEd process on the non-leader nodes same as checking for mpirun on leader node ?

Yes, that is correct. Note, however, that this limits the size of the job to the number of simultaneous ssh connections the system allows you to have.

If there is no limitation on the number of simultaneous ssh connections to the system and if -mca plm_rsh_no_tree_spawn 1 is specified, then it would increase the start up time of the run as the establishment of ssh connections is sequential. Am I right in my understanding ?

Also, is there any recommendation as to when to use the tree based ssh vs using sequential (plm_rsh_no_tree_spawn 0 vs 1)

rhc54 commented 2 years ago

Thanks for the suggestion. Is waiting for the ORTEd process on the non-leader nodes same as checking for mpirun on leader node ?

Not exactly. Each ORTEd notifies mpirun when its local procs terminate (normally or error). Once all ORTEds have notified, mpirun broadcasts a "die" message to the ORTEds. When received, each ORTEd relays the message down the routing tree (if one exists) and then exits.

Key difference is that only mpirun knows the exit status of the job you ran, so only mpirun will exit with that status.

If there is no limitation on the number of simultaneous ssh connections to the system and if -mca plm_rsh_no_tree_spawn 1 is specified, then it would increase the start up time of the run as the establishment of ssh connections is sequential. Am I right in my understanding ?

Correct - it is the drawback of that approach.

Also, is there any recommendation as to when to use the tree based ssh vs using sequential (plm_rsh_no_tree_spawn 0 vs 1)

Generally, it is always best to use the tree spawn - we only turn it off when trying to debug daemons as we don't want any debug messages lost if a daemon dies. In that case, we aren't concerned about scale or launch speed.

satishpasumarthi commented 2 years ago

Thank you @rhc54 for your prompt reply.

Generally, it is always best to use the tree spawn - we only turn it off when trying to debug daemons as we don't want any debug messages lost if a daemon dies. In that case, we aren't concerned about scale or launch speed.

So apart from debugging, it is always recommended to use the tree spawn. Thanks !

If an ORTEd process on the tree at non-leaf level terminates, does the main mpirun process know and log the exact termination message as it would do on a non-tree spawn model ?

rhc54 commented 2 years ago

If an ORTEd process on the tree at non-leaf level terminates, does the main mpirun process know and log the exact termination message as it would do on a non-tree spawn model ?

mpirun will be told about it, regardless of whether or not the ORTEd is on a leaf. mpirun does get the basic error message (e.g., "lost contact"), but I'm not sure what you mean by an "exact" termination message. We rarely have any way of knowing why the ORTEd vanished beyond "lost connection".

If you are looking to see if the ORTEd actually output an error message before terminating, then you need to hold the ssh session open - usually you do that by adding --leave-session-attached to the mpirun cmd line. Again, that will limit your job size as it turns off tree spawn by holding the ssh sessions open throughout the job execution. Note that you may or may not see anything depending upon what causes the ORTEd to terminate.

jsquyres commented 2 years ago

You may want to look in the system logs of machines where you lost contact with an orted. Sometimes an HPC job will suck up too much memory, and the OOM killer will start killing processes. This can sometimes randomly kill the orted, which can result in what you're seeing.

Looking through the system logs may provide some insight on if an external influence (e.g., the OOM killer) caused an orted or MPI process to die unexpectedly.

satishpasumarthi commented 2 years ago

Hi @jsquyres and @rhc54 , Sorry for delayed response on this. We made an interesting observation that the leader node continues execution even though the ORTEd on the non-leader node dies. Also we have seen that a file is being written onto /tmp directory in that node with the name prefix <node-name>. Trying to dig up more details and contents of that file. Will share here as soon as I get to the contents of that file.

open-mpi / ompi