Closed yasushi-saito closed 3 years ago
What happens if you specify the number of slots with the hostname: --host <hostname>:<number of slots>
?
E.g.:
% mpirun --mca plm_rsh_agent ssh --mca plm_rsh_args "-o StrictHostKeyChecking=no" --mca orte_keep_fqdn_hostnames t --host saito-test0-cpu-qql9.us-west1-a.c.luminarycloud-internal.internal:1 echo foo
Adding ":1" doesn't make a difference.
I confirm there is an error that happens only on the master
branch when the hostname has the form <abc><123>-<xyz>.<something>
@rhc54 could prrte
be involved here?
@rhc54 the root cause could be a different one. I also found an other issue (not sure if it is related though)
let's say I run on node0
with IP 1.2.3.4
, and it has an other interface with name node0-ib
and IP 11.12.13.14
.
and then mpirun --host node0-ib ...
under the debugger, I saw that this ends up in the prte_process_info.aliases
:
node0
1.2.3.4
11.12.13.14
but no node0-ib
.
Ultimately, mpirun
ends up ssh node0-ib ... prted ...
(read, it spawns prted
on the node that invoked mpirun
.
This is likely an issue with PRTE. @ggouaillardet Can you file a corresponding issue over there and link it back to this one?
I believe this has been fixed - please see https://github.com/openpmix/prrte/issues/965
@ggouaillardet has updated PRRTE in OMPI master per the referenced PR.
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
https://github.com/open-mpi/ompi/commit/b33b29466f95a60696f9da69f21a61b3fe3e95f4
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
I'm trying to run a MPI job on a machine in Google cloud. When I start MPI using the IP address that refers to the local machine, it works as expected.
But when I use the DNS name of the host instead of the IP address, it fails:
Address 10.138.0.72 refers to the address of the local machine, and
saito-test0-cpu-qql9.us-west1-a.c.luminarycloud-internal.internal
is the name of the same machine. I can run "ssh" to that address without a problem.