[4.1.5] ORTE has lost communication with a remote daemon

wenduwan commented 1 year ago

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.1.5 relese tarball, configured with

configure_options --with-sge --without-verbs --disable-builtin-atomics --with-libfabric/opt/amazon/efa --enable-orterun-prefix-by-default

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

4.1.5 release tarball

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

N/A

Please describe the system on which you are running

Operating system/version: Amazon Linux 2
Computer hardware:EC2. Head node c5.18xlarge(36 cores), Compute nodes g4dn.12xlarge(24 cores)
Network type: Elastic Fabric Adapter

Details of the problem

Encountered ORTE issue when running the command on head node(36 cores). It works when I run on a compute node though.

mpirun --hostfile host_file_with_8_hosts --map-by ppr:1:node hostname
...
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[32679,0],0] on node ip-172-31-45-184
  Remote daemon: [[32679,0],1] on node queue-g4dn12xlarge-st-g4dn12xlarge-1

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

I manually verified two-way tcp traffic by ssh'ing from each compute node to head node. Did not see any issue.

However, I was able to mitigate by providing --bind-to core/socket.

I got hints from this post and followed @rhc54 's suggestion to add debugging info. Finally found something interesting. This line failed.

I have to provide these flags:

 --mca orte_debug_daemons 1 --mca orte_odls_base_verbose 100 --mca orte_state_base_verbose 100 --mca oob_base_verbose 100 --mca rml_base_verbose 100

In the log, the problematic rank seemingly died around here

...
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] Message posted at grpcomm_direct.c:628 for tag 1
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] orted_cmd: received add_local_procs
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] rml:base:send_buffer_nb() to peer [[23399,0],0] through conduit 0
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] OOB_SEND: rml_oob_send.c:265
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] ACTIVATE JOB NULL STATE NEVER LAUNCHED AT base/odls_base_default_fns.c:827
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] ACTIVATE JOB NULL STATE FORCED EXIT AT errmgr_default_orted.c:256
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 50
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 51
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17503] [[23399,0],6] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 6
...

However, on a successful run(by --bind-to core) the log is different

...
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] Message posted at grpcomm_direct.c:628 for tag 1
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] orted_cmd: received add_local_procs
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] local:launch
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] odls:dispatch [[17660,1],5] to thread 0
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] odls:launch spawning child [[17660,1],5]
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] 
    Env[62]: OMPI_MCA_orte_top_session_dir=/tmp/ompi.queue-g4dn12xlarge-st-g4dn12xlarge-6.1000
    Env[63]: OMPI_MCA_orte_jobfam_session_dir=/tmp/ompi.queue-g4dn12xlarge-st-g4dn12xlarge-6.1000/jf.17660
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] ACTIVATE PROC [[17660,1],5] STATE RUNNING AT base/odls_base_default_fns.c:1052
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] ACTIVATE JOB [17660,1] STATE LOCAL LAUNCH COMPLETE AT state_orted.c:297
[queue-g4dn12xlarge-st-g4dn12xlarge-6:17454] [[17660,0],6] rml:base:send_buffer_nb() to peer [[17660,0],0] through conduit 0
...

I'm not clear on the ORTE side, would appreciate some pointers to understand the exact problem.

TIA!

wenduwan commented 1 year ago

Haven't got capacity to work on this yet. The issue still exists. Will take a look in the next few weeks.

rhc54 commented 1 year ago

FWICT, every error path in the local proc launch includes an ORTE_ERROR_LOG - yet I'm not seeing that in your error case. You might want to ensure that you configured with --enable-debug so you get the full debug log. You can also turn off the verbose settings except for odls_base_verbose and set that one to 100.

For some reason, you are getting an error when trying to locally start a process on one of the nodes. The only place the binding comes into play would be here (starting at line 767):

        /* compute and save bindings of local children */
        if (ORTE_SUCCESS != (rc = orte_rmaps_base_compute_bindings(jdata))) {
            ORTE_ERROR_LOG(rc);
            goto REPORT_ERROR;
        }

Again, you should see that error log - but maybe it is a "silent" error that expected orte_rmaps_base_compute_bindings to print out an error. You might need to put an explicit output line there to see if you are getting an error.

jsquyres commented 8 months ago

@wenduwan Is this still an issue?

wenduwan commented 8 months ago

I will try again today/tomorrow and report back.

wenduwan commented 8 months ago

@jsquyres I checked 4.1.6 and this problem still exists.

open-mpi / ompi