Closed shinechou closed 6 years ago
This looks like version confusion between the two nodes. Note that the -x
option only forwards that variable to the application processes, not the daemons launched by mpirun. I suspect the path on the remote node is picking up a different OMPI install. Can you check?
@rhc54: thx a lot. I don't think that is the case. because I was compile/install them using the exactly same configuration, you can check the ompi information as below. Thx again.
ompi info of the local node: Open MPI: 3.0.0 Open MPI repo revision: v3.0.0 Open MPI release date: Sep 12, 2017 Open RTE: 3.0.0 Open RTE repo revision: v3.0.0 Open RTE release date: Sep 12, 2017 OPAL: 3.0.0 OPAL repo revision: v3.0.0 OPAL release date: Sep 12, 2017 MPI API: 3.1.0 Ident string: 3.0.0 Prefix: /usr/local Configured architecture: x86_64-unknown-linux-gnu Configure host: ryan-z820 Configured by: root Configured on: Fri Oct 27 16:11:22 CEST 2017 Configure host: ryan-z820 Configure command line: '--enable-orterun-prefix-by-default' '--with-cuda'
the ompi_info of the remote node: Open MPI repo revision: v3.0.0 Open MPI release date: Sep 12, 2017 Open RTE: 3.0.0 Open RTE repo revision: v3.0.0 Open RTE release date: Sep 12, 2017 OPAL: 3.0.0 OPAL repo revision: v3.0.0 OPAL release date: Sep 12, 2017 MPI API: 3.1.0 Ident string: 3.0.0 Prefix: /usr/local Configured architecture: x86_64-unknown-linux-gnu Configure host: brs-dualG Configured by: brs Configured on: Thu Oct 26 09:47:52 CEST 2017 Configure host: brs-dualG Configure command line: '--enable-orterun-prefix-by-default' '--with-cuda'
Thanks for providing that output. I gather you built/installed them on each node separately, yes? That is a rather unusual way of doing it and generally not recommended - it is much safer to install on a shared file system directory.
Try configuring with --enable-debug
, and then run the following:
$ mpirun -npernode 1 -mca plm_base_verbose 5 hostname
@shinechou You might also want to check that there's not some OS/distro-installed Open MPI on your nodes that is being found and used (e.g., earlier in the PATH than your hand-installed Open MPI installations).
@rhc54: you are right. I install them separately. What is the proper way to do it? Do u have have any guidance for that? Because after I install it I have to install another library on top of ompi library. I'll try to compile it with --enable-debug option and run the command u mentioned.
@jsquyres: thank u for ur comments. But I'm sure there is no other ompi installed on my both nodes.
You could install Open MPI on one node, and then tar up the installation tree on that node, and then untar it on the other node. Then you'd know for sure that you have exactly the same binary installation on both nodes. Something like this:
$ ./configure --prefix=/opt/openmpi-3.0.0
$ make -j 32 install
...
$ cd /opt
$ tar jcf ~/ompi-install-3.0.0.tar.bz2 openmpi-3.0.0
$ scp ~/ompi-install-3.0.0.tar.bz2 @othernode:
$ ssh othernode
...login to othernode...
$ cd /opt
$ rm -rf openmpi-3.0.0
$ sudo tar xf ~/ompi-install-3.0.0.tar.bz2
Usually, people install Open MPI either via package (e.g., RPM) on each node, or they install Open MPI on a network filesystem (such as NFS) so that the one, single installation is available on all nodes.
Note that I mentioned the multiple Open MPI installation issue because the majority of time people run into this error, it's because users are accidentally / unknowingly using multiple different versions of Open MPI (note that Open MPI currently only supports running exactly the same version of Open MPI on all nodes in a single job). This kind of error almost always indicates that version X of Open MPI is trying to read more data than was sent by Open MPI version Y.
Try this exercise:
$ ompi_info | head
$ ssh othernode ompi_info | head
Doing the 2nd line non-interactively is important (i.e., a single command -- not ssh
'ing to the other node and then entering another command to run ompi_info
).
Make sure that both ompi_info
outputs return the same version.
If they do, then there's something configured differently between the two (but which might still be a bug, because "same version but configured differently" should still usually work).
@jsquyres: thx again for ur guidance. I tried ur excercise. They return the same version as below,
$ ompi_info | head Package: Open MPI root@ryan-z820 Distribution Open MPI: 3.0.0 Open MPI repo revision: v3.0.0 Open MPI release date: Sep 12, 2017 Open RTE: 3.0.0 Open RTE repo revision: v3.0.0 Open RTE release date: Sep 12, 2017 OPAL: 3.0.0 OPAL repo revision: v3.0.0 OPAL release date: Sep 12, 2017
$ ssh mpiuser@client ompi_info | head mpiuser@client's password: Package: Open MPI root@brs-dualG Distribution Open MPI: 3.0.0 Open MPI repo revision: v3.0.0 Open MPI release date: Sep 12, 2017 Open RTE: 3.0.0 Open RTE repo revision: v3.0.0 Open RTE release date: Sep 12, 2017 OPAL: 3.0.0 OPAL repo revision: v3.0.0 OPAL release date: Sep 12, 2017
@rhc54: thx. I've tried ur suggestion to run $ mpirun -npernode 1 -mca plm_base_verbose 5 hostname, but I got the error message like,
$ mpirun -npernode 1 -mca plm_base_verbose 5 master
[ryan-z820:21968] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[ryan-z820:21968] plm:base:set_hnp_name: initial bias 21968 nodename hash 974627533
[ryan-z820:21968] plm:base:set_hnp_name: final jobfam 52490
[ryan-z820:21968] [[52490,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[ryan-z820:21968] [[52490,0],0] plm:base:receive start comm
[ryan-z820:21968] [[52490,0],0] plm:base:setup_job
[ryan-z820:21968] [[52490,0],0] plm:base:setup_vm
[ryan-z820:21968] [[52490,0],0] plm:base:setup_vm creating map
[ryan-z820:21968] [[52490,0],0] setup:vm: working unmanaged allocation
[ryan-z820:21968] [[52490,0],0] using default hostfile /usr/local/etc/openmpi-default-hostfile
[ryan-z820:21968] [[52490,0],0] plm:base:setup_vm only HNP in allocation
[ryan-z820:21968] [[52490,0],0] plm:base:setting slots for node ryan-z820 by cores
[ryan-z820:21968] [[52490,0],0] complete_setup on job [52490,1]
[ryan-z820:21968] [[52490,0],0] plm:base:launch_apps for job [52490,1]
[ryan-z820:21968] [[52490,0],0] plm:base:orted_cmd sending orted_exit commands
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 0; it may have occurred for other processes as well.
NOTE: A common cause for this error is misspelling a mpirun command
line parameter option (remember that mpirun interprets the first
unrecognized command line token as the executable).
Node: ryan-z820
Executable: master
--------------------------------------------------------------------------
[ryan-z820:21968] [[52490,0],0] plm:base:receive stop comm
or
$ mpirun -npernode 1 -mca plm_base_verbose 1 master python fcn_horovod.py
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 0; it may have occurred for other processes as well.
NOTE: A common cause for this error is misspelling a mpirun command
line parameter option (remember that mpirun interprets the first
unrecognized command line token as the executable).
Node: ryan-z820
Executable: master
Github pro tip: use three single-tick-marks to denote verbatim regions. See https://guides.github.com/features/mastering-markdown/.
Ok, good, so you have the same Open MPI v3.0.0 installed on both sides. But something must be different between them, or you wouldn't be getting these errors.
Are both machines the same hardware? Are they running the same version of your Linux distro configured generally the same way? (e.g., they're both 64 bit, etc.)
Also, I think @rhc54 meant for you to run the hostname
executable -- not master
. hostname(1)
is a Linux command that tells you what host you are on. He did not mean for you to replace hostname
with the actual hostname of the machine (assumedly master
).
@jsquyres: thx a lot. Sorry that I'm a noob for linux and openmpi so I don't know that hostname is not the hostname (indeed it is master). I am using same version of ubuntu for both nodes (ubuntu 16.04 64-bit desktop version), but indeed the hardware are different for those two nodes, the "master" node is HP Z820 workstation (XEON E5-2670, 64G ECC-RAM, ASUS GTX1080), the "client" node is a DIY PC(i3-7100, 32G DDR4 RAM, ASUS GTX1080Ti), whether the difference of HW configuration of two nodes will result in this error?
''' $ mpirun -npernode 1 -mca plm_base_verbose 5 hostname [ryan-z820:22613] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [ryan-z820:22613] plm:base:set_hnp_name: initial bias 22613 nodename hash 974627533 [ryan-z820:22613] plm:base:set_hnp_name: final jobfam 49295 [ryan-z820:22613] [[49295,0],0] plm:rsh_setup on agent ssh : rsh path NULL [ryan-z820:22613] [[49295,0],0] plm:base:receive start comm [ryan-z820:22613] [[49295,0],0] plm:base:setup_job [ryan-z820:22613] [[49295,0],0] plm:base:setup_vm [ryan-z820:22613] [[49295,0],0] plm:base:setup_vm creating map [ryan-z820:22613] [[49295,0],0] setup:vm: working unmanaged allocation [ryan-z820:22613] [[49295,0],0] using default hostfile /usr/local/etc/openmpi-default-hostfile [ryan-z820:22613] [[49295,0],0] plm:base:setup_vm only HNP in allocation [ryan-z820:22613] [[49295,0],0] plm:base:setting slots for node ryan-z820 by cores [ryan-z820:22613] [[49295,0],0] complete_setup on job [49295,1] [ryan-z820:22613] [[49295,0],0] plm:base:launch_apps for job [49295,1] [ryan-z820:22613] [[49295,0],0] plm:base:launch wiring up iof for job [49295,1] [ryan-z820:22613] [[49295,0],0] plm:base:launch job [49295,1] is not a dynamic spawn ryan-z820 [ryan-z820:22613] [[49295,0],0] plm:base:orted_cmd sending orted_exit commands [ryan-z820:22613] [[49295,0],0] plm:base:receive stop comm '''
Sorry for the confusion - I expected you to retain the -hostfile machinefile
option
@rhc54: thx. Could you please help me to figure it out? Pls check the output as below,
''' $ mpirun -npernode 1 -hostfile machinefile -mca plm_base_verbose 5 hostname [ryan-z820:15616] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [ryan-z820:15616] plm:base:set_hnp_name: initial bias 15616 nodename hash 974627533 [ryan-z820:15616] plm:base:set_hnp_name: final jobfam 42458 [ryan-z820:15616] [[42458,0],0] plm:rsh_setup on agent ssh : rsh path NULL [ryan-z820:15616] [[42458,0],0] plm:base:receive start comm [ryan-z820:15616] [[42458,0],0] plm:base:setup_job [ryan-z820:15616] [[42458,0],0] plm:base:setup_vm [ryan-z820:15616] [[42458,0],0] plm:base:setup_vm creating map [ryan-z820:15616] [[42458,0],0] setup:vm: working unmanaged allocation [ryan-z820:15616] [[42458,0],0] using hostfile machinefile [ryan-z820:15616] [[42458,0],0] checking node ryan-z820 [ryan-z820:15616] [[42458,0],0] ignoring myself [ryan-z820:15616] [[42458,0],0] checking node client [ryan-z820:15616] [[42458,0],0] plm:base:setup_vm add new daemon [[42458,0],1] [ryan-z820:15616] [[42458,0],0] plm:base:setup_vm assigning new daemon [[42458,0],1] to node client [ryan-z820:15616] [[42458,0],0] plm:rsh: launching vm [ryan-z820:15616] [[42458,0],0] plm:rsh: local shell: 0 (bash) [ryan-z820:15616] [[42458,0],0] plm:rsh: assuming same remote shell as local shell [ryan-z820:15616] [[42458,0],0] plm:rsh: remote shell: 0 (bash) [ryan-z820:15616] [[42458,0],0] plm:rsh: final template argv: /usr/bin/ssh PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "2782527488" -mca ess_base_vpid "" -mca ess_base_num_procs "2" -mca orte_node_regex "ryan-z820,client@0(2)" -mca orte_hnp_uri "2782527488.0;tcp://192.168.1.1:50882" -mca plm_base_verbose "5" -mca plm "rsh" -mca rmaps_ppr_n_pernode "1" -mca pmix "^s1,s2,cray,isolated" [ryan-z820:15616] [[42458,0],0] plm:rsh:launch daemon 0 not a child of mine [ryan-z820:15616] [[42458,0],0] plm:rsh: adding node client to launch list [ryan-z820:15616] [[42458,0],0] plm:rsh: activating launch event [ryan-z820:15616] [[42458,0],0] plm:rsh: recording launch of daemon [[42458,0],1] [ryan-z820:15616] [[42458,0],0] plm:rsh: executing: (/usr/bin/ssh) [/usr/bin/ssh client PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "2782527488" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex "ryan-z820,client@0(2)" -mca orte_hnp_uri "2782527488.0;tcp://192.168.1.1:50882" -mca plm_base_verbose "5" -mca plm "rsh" -mca rmaps_ppr_n_pernode "1" -mca pmix "^s1,s2,cray,isolated"] [brs-dualG:03278] [[42458,0],1] plm:rsh_lookup on agent ssh : rsh path NULL [brs-dualG:03278] [[42458,0],1] plm:rsh_setup on agent ssh : rsh path NULL [brs-dualG:03278] [[42458,0],1] plm:base:receive start comm [ryan-z820:15616] [[42458,0],0] plm:base:orted_report_launch from daemon [[42458,0],1] [ryan-z820:15616] [[42458,0],0] plm:base:orted_report_launch from daemon [[42458,0],1] on node brs-dualG [ryan-z820:15616] [[42458,0],0] RECEIVED TOPOLOGY SIG 0N:1S:1L3:2L2:2L1:2C:4H:x86_64:le FROM NODE brs-dualG [ryan-z820:15616] [[42458,0],0] NEW TOPOLOGY - ADDING [ryan-z820:15616] [[42458,0],0] plm:base:orted_report_launch completed for daemon [[42458,0],1] at contact 2782527488.1;tcp://192.168.1.6:33197 [ryan-z820:15616] [[42458,0],0] plm:base:orted_report_launch recvd 2 of 2 reported daemons [ryan-z820:15616] [[42458,0],0] plm:base:setting slots for node ryan-z820 by cores [ryan-z820:15616] [[42458,0],0] plm:base:setting slots for node client by cores [ryan-z820:15616] [[42458,0],0] complete_setup on job [42458,1] [ryan-z820:15616] [[42458,0],0] plm:base:launch_apps for job [42458,1] [brs-dualG:03278] [[42458,0],1] plm:rsh: remote spawn called [brs-dualG:03278] [[42458,0],1] plm:rsh: remote spawn - have no children! [brs-dualG:03278] [[42458,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 351
An internal error has occurred in ORTE:
[[42458,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(355)
This is something that should be reported to the developers.
[ryan-z820:15616] [[42458,0],0] plm:base:receive processing msg [ryan-z820:15616] [[42458,0],0] plm:base:receive update proc state command from [[42458,0],1] [ryan-z820:15616] [[42458,0],0] plm:base:receive got update_proc_state for job [42458,0] [ryan-z820:15616] [[42458,0],0] plm:base:receive got update_proc_state for vpid 1 state CALLED ABORT exit_code -26 [ryan-z820:15616] [[42458,0],0] plm:base:receive done processing commands [ryan-z820:15616] [[42458,0],0] plm:base:orted_cmd sending orted_exit commands [brs-dualG:03278] [[42458,0],1] plm:base:receive stop comm [ryan-z820:15616] [[42458,0],0] plm:base:receive stop comm
'''
@rhc54: I've provided the log with your debug command, could you please help me to check it? thx a lot in advance.
I honestly am stumped - it looks like you basically received an empty buffer, and I have no idea why. I can't replicate it. Perhaps you might try with the nightly snapshot of the 3.0.x branch to see if something has been fixed that might have caused the problem?
@rhc54: thank you. I'll try the nightly version to see what comes out.
@rhc54: the problem has been resolved. It seems that the problem is caused by different HW architecture. One is using xeon but another one is using i3, now I change the xeon one to i3 and it works fine. Maybe another possible reason is the xeon workstation has two network adapters, one is used for AMT, not sure whether or not it will affect ompi though.
@shinechou I have the same problem. And I have check the open-mpi version, they are the same. Could you tell me how to find the hardware problem? I have checked my network and its adapters, they are the same.
@zhanglistar: for me, one of my node is an HP workstation with XEON CPU and it has different HW configuration than the master node (which is a regular PC). So I didn't use the HP workstation but just use another regular PC.
For the benefit of others running into this error or "ORTE_ERROR_LOG: Data unpack had inadequate space": in my case the issue was resolved by switching to the internal hwloc.
I had compiled OpenMPI 3.0.0 on two different Ubuntu releases (16.04 and 17.10), both configured identically, and with --with-hwloc=/usr
, thus using the Ubuntu-provided libhwloc-dev
package. The version of libhwloc
was 1.11.2 on xenial (16.04), and 1.11.5 on artful (17.10).
Running mpirun -H xenial,artful
from artful worked fine, but running it from xenial consistently failed at if (OPAL_SUCCESS != (rc = opal_dss.unpack(data, &topo, &idx, OPAL_HWLOC_TOPO)))
in ./orte/mca/plm/base/plm_base_launch_support.c
.
Removing --with-hwloc=/usr
from the configure step, thus switching to OpenMPI's internal hwloc (and uninstalling libhwloc-dev
on both machines, though this shouldn't be necessary) resolved the issue.
I got this same problem with Open MPI 4.0.1, when built locally on each machine (having machines with different generations of Intel CPUs).
A Sandybridge machine would not be able to communicate with Skylake nodes. Copying the Sandybridge binaries over to the Skylake nodes fixed the issue.
So we have a problem where different architectures produce different binaries (structures ?) which are not compatible protocol-wise.
Do you think that should be fixed or just documented (don't build Open MPI locally on each machine) @rhc54 ?
@sjeaugey It sounds like you built with a different hwloc version on the two types of nodes?
That was not my impression as I could not find any trace of hwloc anywhere on the nodes (so I assume both were compiled with the internal hwloc).
Reading the whole issue, I could not determine whether the fix came from the hwloc change or the fact that the binary was propagated from one machine to the others as suggested by Jeff in https://github.com/open-mpi/ompi/issues/4437#issuecomment-341456301
Now, I did not compile myself the libraries that weren't working properly, nor did I try to re-compile the working version on each node to confirm it would break, so I'm not 100% sure yet. I'll update the bug if I can reproduce it better.
Open MPI Version: v4.0.0
Output of ompi_info | head
on two machine
mpiuser@s2:~$ ssh s1 ompi_info | head
Package: Open MPI mpiuser@s1 Distribution
Open MPI: 4.0.0
Open MPI repo revision: v4.0.0
Open MPI release date: Nov 12, 2018
Open RTE: 4.0.0
Open RTE repo revision: v4.0.0
Open RTE release date: Nov 12, 2018
OPAL: 4.0.0
OPAL repo revision: v4.0.0
OPAL release date: Nov 12, 2018
mpiuser@s2:~$ ompi_info | head
Package: Open MPI mpiuser@s2 Distribution
Open MPI: 4.0.0
Open MPI repo revision: v4.0.0
Open MPI release date: Nov 12, 2018
Open RTE: 4.0.0
Open RTE repo revision: v4.0.0
Open RTE release date: Nov 12, 2018
OPAL: 4.0.0
OPAL repo revision: v4.0.0
OPAL release date: Nov 12, 2018
Both are installed using common shared network.
while running command on s1(master)
mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$ mpirun -n 2 ./hello
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s1 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 112)
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s1 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 112)
while running command separately in s2(slave)
mpiuser@s2:~/cloud$ mpirun -n 2 ./hello
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s2 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 113)
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s2 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 113)
Output of hwloc
command on s2:
mpiuser@s2:~/cloud/openmpi-4.0.0$ dpkg -l | grep hwloc
mpiuser@s2:~/cloud/openmpi-4.0.0$
Output of hwloc
command on s1:
mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$ dpkg -l | grep hwloc
mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$
Both machines are running on Ubuntu 16.04.5 LTS
but while running command on distributed giving following error
mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$ mpirun -host s1,s2 ./hello
[s2:26283] [[40517,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355
--------------------------------------------------------------------------
An internal error has occurred in ORTE:
[[40517,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(359)
This is something that should be reported to the developers.
--------------------------------------------------------------------------
@RahulKulhari Please do not add new issues to a closed issue; thanks.
@RahulKulhari were you able to resolve the issue? Facing same problem!
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Open MPI v3.0.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Following the installation guidance of FAQ,
Please describe the system on which you are running
Details of the problem
I got the error as below,
An internal error has occurred in ORTE:
This is something that should be reported to the developers.__
Thanks a lot in advance.