open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 858 forks source link

ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c #4437

Closed shinechou closed 6 years ago

shinechou commented 6 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Open MPI v3.0.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Following the installation guidance of FAQ,

shell$ gunzip -c openmpi-3.0.0.tar.gz \| tar xf - 
shell$ cd openmpi-3.0.0 
shell$ ./configure --enable-orterun-prefix-by-default --with-cuda
shell$ make all install

Please describe the system on which you are running


Details of the problem

shell$ mpirun -np 2 -x LD_LIBRARY_PATH -hostfile machinefile python fcn_horovod.py 

I got the error as below,

_[brs-dualG:09057] [[48500,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 351

An internal error has occurred in ORTE:

[[48500,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(355)

This is something that should be reported to the developers.__

Thanks a lot in advance.

rhc54 commented 6 years ago

This looks like version confusion between the two nodes. Note that the -x option only forwards that variable to the application processes, not the daemons launched by mpirun. I suspect the path on the remote node is picking up a different OMPI install. Can you check?

shinechou commented 6 years ago

@rhc54: thx a lot. I don't think that is the case. because I was compile/install them using the exactly same configuration, you can check the ompi information as below. Thx again.

ompi info of the local node: Open MPI: 3.0.0 Open MPI repo revision: v3.0.0 Open MPI release date: Sep 12, 2017 Open RTE: 3.0.0 Open RTE repo revision: v3.0.0 Open RTE release date: Sep 12, 2017 OPAL: 3.0.0 OPAL repo revision: v3.0.0 OPAL release date: Sep 12, 2017 MPI API: 3.1.0 Ident string: 3.0.0 Prefix: /usr/local Configured architecture: x86_64-unknown-linux-gnu Configure host: ryan-z820 Configured by: root Configured on: Fri Oct 27 16:11:22 CEST 2017 Configure host: ryan-z820 Configure command line: '--enable-orterun-prefix-by-default' '--with-cuda'

the ompi_info of the remote node: Open MPI repo revision: v3.0.0 Open MPI release date: Sep 12, 2017 Open RTE: 3.0.0 Open RTE repo revision: v3.0.0 Open RTE release date: Sep 12, 2017 OPAL: 3.0.0 OPAL repo revision: v3.0.0 OPAL release date: Sep 12, 2017 MPI API: 3.1.0 Ident string: 3.0.0 Prefix: /usr/local Configured architecture: x86_64-unknown-linux-gnu Configure host: brs-dualG Configured by: brs Configured on: Thu Oct 26 09:47:52 CEST 2017 Configure host: brs-dualG Configure command line: '--enable-orterun-prefix-by-default' '--with-cuda'

rhc54 commented 6 years ago

Thanks for providing that output. I gather you built/installed them on each node separately, yes? That is a rather unusual way of doing it and generally not recommended - it is much safer to install on a shared file system directory.

Try configuring with --enable-debug, and then run the following:

$ mpirun -npernode 1 -mca plm_base_verbose 5 hostname
jsquyres commented 6 years ago

@shinechou You might also want to check that there's not some OS/distro-installed Open MPI on your nodes that is being found and used (e.g., earlier in the PATH than your hand-installed Open MPI installations).

shinechou commented 6 years ago

@rhc54: you are right. I install them separately. What is the proper way to do it? Do u have have any guidance for that? Because after I install it I have to install another library on top of ompi library. I'll try to compile it with --enable-debug option and run the command u mentioned.

@jsquyres: thank u for ur comments. But I'm sure there is no other ompi installed on my both nodes.

jsquyres commented 6 years ago

You could install Open MPI on one node, and then tar up the installation tree on that node, and then untar it on the other node. Then you'd know for sure that you have exactly the same binary installation on both nodes. Something like this:

$ ./configure --prefix=/opt/openmpi-3.0.0
$ make -j 32 install
...
$ cd /opt
$ tar jcf ~/ompi-install-3.0.0.tar.bz2 openmpi-3.0.0
$ scp ~/ompi-install-3.0.0.tar.bz2 @othernode:

$ ssh othernode
...login to othernode...
$ cd /opt
$ rm -rf openmpi-3.0.0
$ sudo tar xf ~/ompi-install-3.0.0.tar.bz2

Usually, people install Open MPI either via package (e.g., RPM) on each node, or they install Open MPI on a network filesystem (such as NFS) so that the one, single installation is available on all nodes.


Note that I mentioned the multiple Open MPI installation issue because the majority of time people run into this error, it's because users are accidentally / unknowingly using multiple different versions of Open MPI (note that Open MPI currently only supports running exactly the same version of Open MPI on all nodes in a single job). This kind of error almost always indicates that version X of Open MPI is trying to read more data than was sent by Open MPI version Y.

Try this exercise:

$ ompi_info | head
$ ssh othernode ompi_info | head

Doing the 2nd line non-interactively is important (i.e., a single command -- not ssh'ing to the other node and then entering another command to run ompi_info).

Make sure that both ompi_info outputs return the same version.

If they do, then there's something configured differently between the two (but which might still be a bug, because "same version but configured differently" should still usually work).

shinechou commented 6 years ago

@jsquyres: thx again for ur guidance. I tried ur excercise. They return the same version as below,

$ ompi_info | head Package: Open MPI root@ryan-z820 Distribution Open MPI: 3.0.0 Open MPI repo revision: v3.0.0 Open MPI release date: Sep 12, 2017 Open RTE: 3.0.0 Open RTE repo revision: v3.0.0 Open RTE release date: Sep 12, 2017 OPAL: 3.0.0 OPAL repo revision: v3.0.0 OPAL release date: Sep 12, 2017

$ ssh mpiuser@client ompi_info | head mpiuser@client's password: Package: Open MPI root@brs-dualG Distribution Open MPI: 3.0.0 Open MPI repo revision: v3.0.0 Open MPI release date: Sep 12, 2017 Open RTE: 3.0.0 Open RTE repo revision: v3.0.0 Open RTE release date: Sep 12, 2017 OPAL: 3.0.0 OPAL repo revision: v3.0.0 OPAL release date: Sep 12, 2017

shinechou commented 6 years ago

@rhc54: thx. I've tried ur suggestion to run $ mpirun -npernode 1 -mca plm_base_verbose 5 hostname, but I got the error message like,

$ mpirun -npernode 1 -mca plm_base_verbose 5 master
[ryan-z820:21968] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[ryan-z820:21968] plm:base:set_hnp_name: initial bias 21968 nodename hash 974627533
[ryan-z820:21968] plm:base:set_hnp_name: final jobfam 52490
[ryan-z820:21968] [[52490,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[ryan-z820:21968] [[52490,0],0] plm:base:receive start comm
[ryan-z820:21968] [[52490,0],0] plm:base:setup_job
[ryan-z820:21968] [[52490,0],0] plm:base:setup_vm
[ryan-z820:21968] [[52490,0],0] plm:base:setup_vm creating map
[ryan-z820:21968] [[52490,0],0] setup:vm: working unmanaged allocation
[ryan-z820:21968] [[52490,0],0] using default hostfile /usr/local/etc/openmpi-default-hostfile
[ryan-z820:21968] [[52490,0],0] plm:base:setup_vm only HNP in allocation
[ryan-z820:21968] [[52490,0],0] plm:base:setting slots for node ryan-z820 by cores
[ryan-z820:21968] [[52490,0],0] complete_setup on job [52490,1]
[ryan-z820:21968] [[52490,0],0] plm:base:launch_apps for job [52490,1]
[ryan-z820:21968] [[52490,0],0] plm:base:orted_cmd sending orted_exit commands
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       ryan-z820
Executable: master
--------------------------------------------------------------------------
[ryan-z820:21968] [[52490,0],0] plm:base:receive stop comm

or

$ mpirun -npernode 1 -mca plm_base_verbose 1 master python fcn_horovod.py 
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       ryan-z820
Executable: master
jsquyres commented 6 years ago

Github pro tip: use three single-tick-marks to denote verbatim regions. See https://guides.github.com/features/mastering-markdown/.

Ok, good, so you have the same Open MPI v3.0.0 installed on both sides. But something must be different between them, or you wouldn't be getting these errors.

Are both machines the same hardware? Are they running the same version of your Linux distro configured generally the same way? (e.g., they're both 64 bit, etc.)

Also, I think @rhc54 meant for you to run the hostname executable -- not master. hostname(1) is a Linux command that tells you what host you are on. He did not mean for you to replace hostname with the actual hostname of the machine (assumedly master).

shinechou commented 6 years ago

@jsquyres: thx a lot. Sorry that I'm a noob for linux and openmpi so I don't know that hostname is not the hostname (indeed it is master). I am using same version of ubuntu for both nodes (ubuntu 16.04 64-bit desktop version), but indeed the hardware are different for those two nodes, the "master" node is HP Z820 workstation (XEON E5-2670, 64G ECC-RAM, ASUS GTX1080), the "client" node is a DIY PC(i3-7100, 32G DDR4 RAM, ASUS GTX1080Ti), whether the difference of HW configuration of two nodes will result in this error?

''' $ mpirun -npernode 1 -mca plm_base_verbose 5 hostname [ryan-z820:22613] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [ryan-z820:22613] plm:base:set_hnp_name: initial bias 22613 nodename hash 974627533 [ryan-z820:22613] plm:base:set_hnp_name: final jobfam 49295 [ryan-z820:22613] [[49295,0],0] plm:rsh_setup on agent ssh : rsh path NULL [ryan-z820:22613] [[49295,0],0] plm:base:receive start comm [ryan-z820:22613] [[49295,0],0] plm:base:setup_job [ryan-z820:22613] [[49295,0],0] plm:base:setup_vm [ryan-z820:22613] [[49295,0],0] plm:base:setup_vm creating map [ryan-z820:22613] [[49295,0],0] setup:vm: working unmanaged allocation [ryan-z820:22613] [[49295,0],0] using default hostfile /usr/local/etc/openmpi-default-hostfile [ryan-z820:22613] [[49295,0],0] plm:base:setup_vm only HNP in allocation [ryan-z820:22613] [[49295,0],0] plm:base:setting slots for node ryan-z820 by cores [ryan-z820:22613] [[49295,0],0] complete_setup on job [49295,1] [ryan-z820:22613] [[49295,0],0] plm:base:launch_apps for job [49295,1] [ryan-z820:22613] [[49295,0],0] plm:base:launch wiring up iof for job [49295,1] [ryan-z820:22613] [[49295,0],0] plm:base:launch job [49295,1] is not a dynamic spawn ryan-z820 [ryan-z820:22613] [[49295,0],0] plm:base:orted_cmd sending orted_exit commands [ryan-z820:22613] [[49295,0],0] plm:base:receive stop comm '''

rhc54 commented 6 years ago

Sorry for the confusion - I expected you to retain the -hostfile machinefile option

shinechou commented 6 years ago

@rhc54: thx. Could you please help me to figure it out? Pls check the output as below,

''' $ mpirun -npernode 1 -hostfile machinefile -mca plm_base_verbose 5 hostname [ryan-z820:15616] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [ryan-z820:15616] plm:base:set_hnp_name: initial bias 15616 nodename hash 974627533 [ryan-z820:15616] plm:base:set_hnp_name: final jobfam 42458 [ryan-z820:15616] [[42458,0],0] plm:rsh_setup on agent ssh : rsh path NULL [ryan-z820:15616] [[42458,0],0] plm:base:receive start comm [ryan-z820:15616] [[42458,0],0] plm:base:setup_job [ryan-z820:15616] [[42458,0],0] plm:base:setup_vm [ryan-z820:15616] [[42458,0],0] plm:base:setup_vm creating map [ryan-z820:15616] [[42458,0],0] setup:vm: working unmanaged allocation [ryan-z820:15616] [[42458,0],0] using hostfile machinefile [ryan-z820:15616] [[42458,0],0] checking node ryan-z820 [ryan-z820:15616] [[42458,0],0] ignoring myself [ryan-z820:15616] [[42458,0],0] checking node client [ryan-z820:15616] [[42458,0],0] plm:base:setup_vm add new daemon [[42458,0],1] [ryan-z820:15616] [[42458,0],0] plm:base:setup_vm assigning new daemon [[42458,0],1] to node client [ryan-z820:15616] [[42458,0],0] plm:rsh: launching vm [ryan-z820:15616] [[42458,0],0] plm:rsh: local shell: 0 (bash) [ryan-z820:15616] [[42458,0],0] plm:rsh: assuming same remote shell as local shell [ryan-z820:15616] [[42458,0],0] plm:rsh: remote shell: 0 (bash) [ryan-z820:15616] [[42458,0],0] plm:rsh: final template argv: /usr/bin/ssh