Closed GoogleCodeExporter closed 8 years ago
That is strange indeed. I've tried it (r1352) on different systems:
1) Windows laptop, gcc 4.8.1, MPICH2 1.4.1p1 (here I used 'mpiexec -n 2'
instead of 'mpirun -np 2').
2) Unix cluster, gcc 4.7.0, openmpi 1.4.5
3) Unix cluster, gcc 4.3.2, openmpi 1.4.3
and could not reproduce the issue (everything is working fine).
Therefore, I can only suggest a number of (crazy) tests/ideas at the moment.
- make sure that you do not have modified files in your source (run svn diff).
- run './adda_mpi -V' and show the output (down to GPL text).
- further let's simplify the test runs a little bit. Attached is sphere4.geom
(obtained by 'adda -grid 4 -save_geom'). I propose the command line:
... -shape read sphere4.geom -sym enf
- play with compile flags. Remove USE_SSE3 (test) then additionally add
DEBUGFULL (test).
- Then use the latter debug version to save as much as possible, like:
... -shape read sphere4.geom -sym enf -store_beam > out
and show here all the output (including stdout).
- If that is possible, please also try different (earlier) versions of gcc
and/or openmpi.
- Just in case (should not make any difference) try using 'mpiexec -n 2'
instead of 'mpirun -np 2'.
Original comment by yurkin
on 14 Jul 2014 at 6:14
Attachments:
Your suggestions were not crazy at all.
I found the problem.
After many tests (with different gcc versions and compiler options) I tried to
compile an updated version of openmpi (1.8.1).
Then I was able to get the right results using only the updated mpirun (without
recompiling adda).
This was the most annoying part of the tests because the official Ubuntu 14.04
repository only have the 1.6 version of openmpi, thus, I had to compile the
source of the updated version.
I think that in the end this issue may be useful.
The current Ubuntu 14.04 LTS repository only have openmpi-1.6 which seems the
reason of my crazy results.
Even if sparse mode is not the most frequently used option for adda, the ubuntu
14.04 users should be aware that they have to prefer mpich or a selfcompiled
openmpi library.
Of course, I am available for further investigations on this problem.
Original comment by davide.o...@gmail.com
on 14 Jul 2014 at 5:16
OK, we have a similar problem with gcc 4.6.2 (also comes with Ubuntu LTS, but
12.04) - issue 194, which was unsolvable on our side. Now we have another with
openmpi...
Davide, can you please localize the bug to a particular version (or range of
versions) of openmpi. Ideally, it would be great if we can connect it to a
certain bug at openmpi issue tracker, but that may be hard to pinpoint.
Also, please run the debug version with storing incident beam, as I mentioned
earlier, and report the output. I want to localize the problem in ADDA source
as well. Maybe (though unlikely) it would be possible to arrange some MPI code
to make it operational under faulty openmpi as well...
Original comment by yurkin
on 14 Jul 2014 at 6:47
Ok I have run some tests.
Find all results in the attached ompi_check.zip
In the zipped file you'll find openmpi_check.txt which lists my openmpi tests
with various library versions (all library and adda are compiled with gcc-4.8).
Basically I found that the problem emerge with openmpi-1.5.4 and last until
openmpi-1.7.2
The first openmpi version NOT affected by the bug is 1.7.3
I've also run DEBUGFULL with 1 and 2 MPI processes and for openmpi version
1.6.5 (affected) and 1.8.1 (not affected).
All executables are compiled with gcc-4.8, DEBUGFULL and runned with
... -shape read sphere4.geom -sym enf -store_beam > out...
You'll find all the results in the corresponding directories
run-OMPI_VERSION-np#/
Inside each directory you also find the corresponding output
out-OMPI_VERSION-np#
Original comment by davide.o...@gmail.com
on 17 Jul 2014 at 10:14
Attachments:
I don't know, how the problem appeared in OpenMPI, but it seems to be fixed by
this revision - https://svn.open-mpi.org/trac/ompi/changeset/29187 , and then
it was incorporated into 1.7.3 - https://svn.open-mpi.org/trac/ompi/ticket/3772
.
In ADDA the error probably appears during call to
MPI_Allgatherv(MPI_IN_PLACE,0,...), which in turn happen only during calls of
AllGather(NULL,...). The latter are only called in Sparse MPI code in two
places - matvec.c (for matrix-vector product) and in make_particle.c (to make
position_full).
Interesting, that both this calls should become irrelevant if issue 160 is
implemented. But for now we need some workaround, or at least a meaningful
warning/error message when faulty openmpi is used.
Original comment by yurkin
on 18 Jul 2014 at 9:14
Maybe I am missing something, but that parts of the code are left unchanged
from adda-1.2 to the current 1.3 revision.
If the problem appears during MPI_Allgatherv it should also be present in the
past realese of adda. Am I correct?
On the contrary adda-1.2 is not affected by the this problem.
After some tests I have localized that the problems appears with adda-r1253 (at
least with ompi-1.6.5, I did not test with other versions of ompi).
Original comment by davide.o...@gmail.com
on 18 Jul 2014 at 12:12
Davide, that is an important comment. I forgot about dependence on ADDA
version. Then it seems that MPI_Allgatherv only have problems for complex
built-in datatypes like MPI_C_DOUBLE_COMPLEX. Let's check it.
Please, change in parbas.h
#ifdef MPI_C_DOUBLE_COMPLEX
to
#if 0
Then recompile and test with ompi-1.6.5.
Original comment by yurkin
on 21 Jul 2014 at 4:18
Checked!
Everything is fine with this change and ompi-1.6.5
Maybe it is possible to avoid the problem adding an ompi version checking in
parbas.h?
Original comment by davide.o...@gmail.com
on 21 Jul 2014 at 9:15
It seems it was partly my fault after all. Davide, please test the recent
r1355, it should fix the problem. It would be great if you can test the
boundary cases (1.7.2 and 1.7.3), and also compiling ADDA using one version of
OpenMPI then executing with another. ADDA should either work correctly or
produce a meaningful error message.
Original comment by yurkin
on 21 Jul 2014 at 10:58
Everything works fine except mpirun-1.7.2 running adda compiled with 1.7.3
which produces the following message as expected
ERROR: (../comm.c:296) MPI library version (2.1) is too old for current ADDA
executable. Version 2.2 or newer is required. Alternatively, you may recompile
ADDA using this version of the library.
Original comment by davide.o...@gmail.com
on 21 Jul 2014 at 2:51
Great, thanks for your efforts.
Original comment by yurkin
on 21 Jul 2014 at 3:05
Original issue reported on code.google.com by
davide.o...@gmail.com
on 13 Jul 2014 at 8:32Attachments: