Open tonycurtis opened 6 hours ago
Could you try running
mpirun --debug-daemons -np 95 -N 1 hostname
that may help provide some more info for triaging.
Er, Howard, do you mean run 95 ranks on 1 node? Or run hostname on 1 rank on 95 nodes?
Anyway:
salloc -p all-nodes -N 95 mpirun --debug-daemons hostname
The mpirun cmd line will execute one instance of hostname
on each of 95 nodes. The --debug-daemons
flag will hold the stdout/stderr connections open between the daemons so you can see any error messages. I'm not sure if that salloc
command will do the same thing, but will take a gander at the output.
Looks like the daemons are unable to send a message back to mpirun
- maybe there is an issue with their choice of transport? You might need to specify the network they should use.
Thank you for taking the time to submit an issue!
Background information
Running installation tests on cluster, v5 (release or from github) works up to <= 94 nodes, then fails instantly. v4 works fine. N.B. this is running a job from a login node via SLURM (salloc + mpiexec).
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
main @ 448c3ba2d1b8dced090e5aefb7ccb07588613bcd
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
source / git
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
Beyond 94 nodes