open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 859 forks source link

openmpi version 1.8 and above crash #3807

Closed gruel closed 7 years ago

gruel commented 7 years ago

Background information

I am trying to compile and use the finite element software parafem. I discover a problem in openmpi which appear between the version 1.6.5 and 1.8. I must mention that I tested it with mpich 3.2 and it is working as expected.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Version: 1.6 (last one working properly), 1.8, 2.1.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

I use different computer and system. Mainly my laptop (Intel based) with archlinux and openmpi from the distribution, I also tested in a docker image (ubuntu) again with openmpi provided by the distribution and I also tested it on a HPC from the university of manchester with different openmpi version available. This is how I isolate the problem appearing after version 1.6.5 which was confirmed later by manual compilation.

Please describe the system on which you are running


Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Using the software with openmpi does crash with the following error message:

mpirun -n 4 ../bin/p121 p121_demo

PE no:        1 nels_pp:     2000
PE no:        3 nels_pp:     2000
PE no:        2 nels_pp:     2000
PE no:        4 nels_pp:     2000
PE no:        3 neq_pp:    24590
PE no:        1 neq_pp:    24590
PE no:        2 neq_pp:    24590
PE no:        4 neq_pp:    24590
Average accesses ratio - remote/local:     0.11
Total remote accesses                :    10920
Average remote accesses per PE       :  1820.00
[29de3eacedef:2122] *** An error occurred in MPI_Recv
[29de3eacedef:2122] *** reported by process [2024210433,2]
[29de3eacedef:2122] *** on communicator MPI_COMM_WORLD
[29de3eacedef:2122] *** MPI_ERR_TRUNCATE: message truncated
[29de3eacedef:2122] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[29de3eacedef:2122] ***    and potentially your MPI job)
[29de3eacedef:02115] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[29de3eacedef:02115] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

the proper output should be something like:

mpirun -n 4 ../bin/p121 p121_demo

PE no:        1 nels_pp:     2000
PE no:        2 nels_pp:     2000
PE no:        3 nels_pp:     2000
PE no:        4 nels_pp:     2000
PE no:        1 neq_pp:    24590
PE no:        2 neq_pp:    24590
PE no:        3 neq_pp:    24590
PE no:        4 neq_pp:    24590
Average accesses ratio - remote/local:     0.11
Total remote accesses                :    10920
Average remote accesses per PE       :  1820.00
STOP ParaFEM: shutdown: the program terminated successfully
STOP ParaFEM: shutdown: the program terminated successfully
STOP ParaFEM: shutdown: the program terminated successfully
STOP ParaFEM: shutdown: the program terminated successfully

Here a script to download and compile the software:

svn co https://svn.code.sf.net/p/parafem/code/trunk parafem-code
cd parafem-code/parafem

MACHINE=linuxdesktop ./make-parafem 
mkdir test

cp examples/5th_ed/p121/demo/p121_demo.mg test/
cd test

echo "Test parafem without mpi" 
../bin/p12meshgen p121_demo
../bin/p121 p121_demo

echo "Test parafem with mpi"
mpirun -np 4 ../bin/p121 p121_demo
ggouaillardet commented 7 years ago

thanks for the report, i will take a crack at it. meanwhile, can you please try to run again with mpirun --mca coll ^tuned ... there is a known issue with coll/tuned when using matching signatures (e.g. one big datatype on one hand, and many small datatypes on the other hand) so i'd like to figure out whether this is a known issue or not

gruel commented 7 years ago

This is the return of the command you asked. I hope that will help. Thanks.

mpirun --mca coll ^tuned  ../bin/p121 p121_demo

PE no:        2 nels_pp:     2000
PE no:        3 nels_pp:     2000
PE no:        1 nels_pp:     2000
PE no:        4 nels_pp:     2000
PE no:        3 neq_pp:    24590
PE no:        2 neq_pp:    24590
PE no:        1 neq_pp:    24590
PE no:        4 neq_pp:    24590
Average accesses ratio - remote/local:     0.11
Total remote accesses                :    10920
Average remote accesses per PE       :  1820.00
[ruth:24241] *** An error occurred in MPI_Recv
[ruth:24241] *** reported by process [233635841,1]
[ruth:24241] *** on communicator MPI_COMM_WORLD
[ruth:24241] *** MPI_ERR_TRUNCATE: message truncated
[ruth:24241] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ruth:24241] ***    and potentially your MPI job)
[ruth:24235] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[ruth:24235] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
ggouaillardet commented 7 years ago

thanks for the log. this is not the issue i had in mind, so unfortunately i have no workaround to provide yet.

i will investigate this issue this week

ggouaillardet commented 7 years ago

to me, that looks like a bug in the app.

basically, it does

      CALL MPI_PROBE(MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,  &
          vstatus(1,numpesput+1),ier)
c ...
      CALL MPI_RECV (toput_temp(1,ii),lenput(pe_number),MPI_INTEGER,        &
           MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,vstatus(1,numpesput),ier)

and i am not sure there is any guarantee that MPI_RECV will indeed receive the message that was previously MPI_PROBE'd

@bosilca @jsquyres @hjelmn can you please comment on that ?

the inline patch below fixes that

Index: src/modules/mpi/gather_scatter.f90
===================================================================
--- src/modules/mpi/gather_scatter.f90  (revision 2267)
+++ src/modules/mpi/gather_scatter.f90  (working copy)
@@ -1703,7 +1703,7 @@
       pesput(ii) = pe_number
       putpes(pe_number) = ii
       CALL MPI_RECV (toput_temp(1,ii),lenput(pe_number),MPI_INTEGER,        &
-           MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,vstatus(1,numpesput),ier)
+           vstatus(MPI_SOURCE,numpesput),pe_number,MPI_COMM_WORLD,vstatus(1,numpesput),ier)
       IF (ier .NE. MPI_SUCCESS) THEN
          CALL MPERROR('Error in (A5) receive',ier)
       END IF
gruel commented 7 years ago

Thank you very much. Indeed the patch is fixing the problem. Take care.

jsquyres commented 7 years ago

@ggouaillardet is correct -- just because you PROBE successfully for a message, if you RECV with ANY_SOURCE and/or ANY_TAG, it's not guaranteed that you'll get the same message that you PROBEd for.

A similar mechanism to get the same effect that @ggouaillardet showed is also to use MPROBE (which, if it returns success, will remove the message in question from matching, and you can MRECV to get exactly that message).

bosilca commented 7 years ago

In MPI the order of reception between multiple peers is not deterministic. Basically, during the PROBE OMPI tries to find a message from all potential peers in some order, and returns the first matching message. However, between this moment and the receive from ANY_SOURCE, a message might have been received from another peer that is placed earlier in the tested list of peers, and because of the loose receive requirements (ANY_SOURCE) this new message might match. @ggouaillardet patch removes the peer non-determinism by forcing the receive to be issued on the peer returned by the PROBE. MPI guarantees the FIFO ordering of messages in a communicator for a peer, the fact that you use ANY_TAG will therefore not introduce any non-determinism.

As long as your application only receives from a particular communicator in a single thread, the approach proposed by Gilles is correct. @jsquyres solution is more generic and provides the only sensible solution in a multi-threaded case.