open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.14k stars 859 forks source link

v5.x failure beyond 94 nodes #12872

Open tonycurtis opened 6 hours ago

tonycurtis commented 6 hours ago

Thank you for taking the time to submit an issue!

Background information

Running installation tests on cluster, v5 (release or from github) works up to <= 94 nodes, then fails instantly. v4 works fine. N.B. this is running a job from a login node via SLURM (salloc + mpiexec).

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

main @ 448c3ba2d1b8dced090e5aefb7ccb07588613bcd

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

source / git

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

 e32e0179bc6bd1637f92690511ce6091719fa046 3rd-party/openpmix (v1.1.3-4036-ge32e0179)
 0f0a90006cbc880d499b2356d6076e785e7868ba 3rd-party/prrte (psrvr-v2.0.0rc1-4819-g0f0a90006c)
 dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main-1-gdfff675)

Please describe the system on which you are running


Details of the problem

Beyond 94 nodes

--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-login2-2232463@0,0] on node login2
  Remote daemon: [prterun-login2-2232463@0,28] on node fj094

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
hppritcha commented 5 hours ago

Could you try running

mpirun --debug-daemons -np 95 -N 1 hostname

that may help provide some more info for triaging.

tonycurtis commented 5 hours ago

Er, Howard, do you mean run 95 ranks on 1 node? Or run hostname on 1 rank on 95 nodes?

Anyway:

salloc -p all-nodes -N 95  mpirun --debug-daemons hostname

out.txt

rhc54 commented 3 hours ago

The mpirun cmd line will execute one instance of hostname on each of 95 nodes. The --debug-daemons flag will hold the stdout/stderr connections open between the daemons so you can see any error messages. I'm not sure if that salloc command will do the same thing, but will take a gander at the output.

rhc54 commented 1 hour ago

Looks like the daemons are unable to send a message back to mpirun - maybe there is an issue with their choice of transport? You might need to specify the network they should use.