open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 858 forks source link

Trouble running mpi using a DNS address #8962

Closed yasushi-saito closed 3 years ago

yasushi-saito commented 3 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

https://github.com/open-mpi/ompi/commit/b33b29466f95a60696f9da69f21a61b3fe3e95f4

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

$src_dir/configure \
    --with-cuda \
    --with-hwloc=internal \
    --disable-pty-support \
    --disable-per-user-config-files \
    --disable-mpi-fortran \
    --disable-libompitrace \
    --disable-mpi-cxx \
    --disable-mpi-cxx-seek \
    --enable-shared \
    --disable-static

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

 24360562b169e7ab0cbf57466130f2d95a83163d 3rd-party/openpmix (v1.1.3-2939-g24360562)
 0a346874f639ce61f7f3acbbc3019e6ade97b623 3rd-party/prrte (dev-31118-g0a346874f6)

Please describe the system on which you are running


Details of the problem

I'm trying to run a MPI job on a machine in Google cloud. When I start MPI using the IP address that refers to the local machine, it works as expected.

% mpirun --mca plm_rsh_agent ssh --mca plm_rsh_args "-o StrictHostKeyChecking=no" --mca orte_keep_fqdn_hostnames t --host 10.138.0.72 echo foo
foo

But when I use the DNS name of the host instead of the IP address, it fails:

% mpirun --mca plm_rsh_agent ssh --mca plm_rsh_args "-o StrictHostKeyChecking=no" --mca orte_keep_fqdn_hostnames t --host saito-test0-cpu-qql9.us-west1-a.c.luminarycloud-internal.internal echo foo

--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 1
slots that were requested by the application:

  echo

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the PRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, PRTE defaults to the number of processor cores

In all the above cases, if you want PRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------

Address 10.138.0.72 refers to the address of the local machine, and saito-test0-cpu-qql9.us-west1-a.c.luminarycloud-internal.internal is the name of the same machine. I can run "ssh" to that address without a problem.

% ssh saito-test0-cpu-qql9.us-west1-a.c.luminarycloud-internal.internal echo bar
bar

% dig saito-test0-cpu-qql9.us-west1-a.c.luminarycloud-internal.internal

; <<>> DiG 9.16.1-Ubuntu <<>> saito-test0-cpu-qql9.us-west1-a.c.luminarycloud-internal.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52277
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;saito-test0-cpu-qql9.us-west1-a.c.luminarycloud-internal.internal. IN A

;; ANSWER SECTION:
saito-test0-cpu-qql9.us-west1-a.c.luminarycloud-internal.internal. 30 IN A 10.138.0.72

;; Query time: 4 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Thu May 13 22:04:07 UTC 2021
;; MSG SIZE  rcvd: 110

% ip address

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP group default qlen 1000
    link/ether 42:01:0a:8a:00:48 brd ff:ff:ff:ff:ff:ff
    inet 10.138.0.72/32 scope global dynamic ens4
       valid_lft 2996sec preferred_lft 2996sec
    inet6 fe80::4001:aff:fe8a:48/64 scope link
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:73:31:9a:e5 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:73ff:fe31:9ae5/64 scope link
       valid_lft forever preferred_lft forever
hkuno commented 3 years ago

What happens if you specify the number of slots with the hostname: --host <hostname>:<number of slots> ?

E.g.: % mpirun --mca plm_rsh_agent ssh --mca plm_rsh_args "-o StrictHostKeyChecking=no" --mca orte_keep_fqdn_hostnames t --host saito-test0-cpu-qql9.us-west1-a.c.luminarycloud-internal.internal:1 echo foo

yasushi-saito commented 3 years ago

Adding ":1" doesn't make a difference.

ggouaillardet commented 3 years ago

I confirm there is an error that happens only on the master branch when the hostname has the form <abc><123>-<xyz>.<something>

@rhc54 could prrte be involved here?

ggouaillardet commented 3 years ago

@rhc54 the root cause could be a different one. I also found an other issue (not sure if it is related though)

let's say I run on node0 with IP 1.2.3.4, and it has an other interface with name node0-ib and IP 11.12.13.14.

and then mpirun --host node0-ib ...

under the debugger, I saw that this ends up in the prte_process_info.aliases :

but no node0-ib.

Ultimately, mpirun ends up ssh node0-ib ... prted ... (read, it spawns prted on the node that invoked mpirun.

jsquyres commented 3 years ago

This is likely an issue with PRTE. @ggouaillardet Can you file a corresponding issue over there and link it back to this one?

rhc54 commented 3 years ago

I believe this has been fixed - please see https://github.com/openpmix/prrte/issues/965

@ggouaillardet has updated PRRTE in OMPI master per the referenced PR.