Some tests fail with mpich

dschwoerer commented 4 years ago

Expected behavior

All tests pass

Actual behavior

90/90 Test #90: icb_parpack_cpp_tst .............. Passed 0.16 sec

91% tests passed, 8 tests failed out of 90

Total Test time (real) = 1.54 sec

The following tests FAILED: 72 - pcndrv1_ex (Failed) 73 - pdndrv1_ex (Failed) 74 - pdndrv3_ex (Failed) 75 - pdsdrv1_ex (Failed) 76 - psndrv1_ex (Failed) 77 - psndrv3_ex (Failed) 78 - pssdrv1_ex (Failed) 79 - pzndrv1_ex (Failed) Errors while running CTest

Where/how to reproduce the problem

arpack-ng: master (6b04aa4bcce47cbf51579314cb099062cc716c86)
OS: fedora 30, fedora 33
compiler: gcc 9 and gcc 10 (with fixes for gcc 10)
environment: module load mpi/mpich-x86_64
configure: cmake -DEXAMPLES=ON -DMPI=ON -DICB=ON .. && make -j 4 && make test

Steps to reproduce the problem

docker run fedora
dnf -y install emacs git autoconf libtool make gcc-c++ gfortran openblas-devel environment-modules mpich-devel
. /etc/profile.d/modules.sh
module load mpi/mpich-x86_64
cmake -DEXAMPLES=ON -DMPI=ON -DICB=ON .. && make -j 4 && make test

Error message

see above

Traces

 --------------------------------------------------------
    1 -    2: (-5.04608E+01,-3.56467E+02)  ( 9.82758E+01, 3.43965E+02)
    3 -    3: ( 7.62495E-04, 0.00000E+00)

 _naup2: no. of "converged" Ritz values at this iter.
 ----------------------------------------------------
    1 -    1:       0

 _napps: matrix splitting at row/column no.
 ------------------------------------------
    1 -    1:       2

 _napps: matrix splitting with shift number.
 -------------------------------------------
    1 -    1:       2

 _napps: off diagonal element.
 -----------------------------
    1 -    1: (-6.10352E-05, 3.05176E-05)

 _naup2: **** Start of major iteration number ****
 -------------------------------------------------
    1 -    1:    1015

 _ngets: KEV is
 --------------
    1 -    1:       1

 _ngets: NP is
 -------------
    1 -    1:       2

 _ngets: Eigenvalues of current H matrix
 ----------------------------------------
    1 -    2: ( 1.43426E+02, 1.43426E+02)  ( 8.61754E+02, 8.61754E+02)
    3 -    3: ( 1.00004E+03, 1.00004E+03)

 _ngets: Ritz estimates of the current KEV+NP Ritz values
 --------------------------------------------------------
    1 -    2: (-1.00379E+02,-3.32978E+02)  (-3.44731E+02, 5.06945E-10)
    3 -    3: ( 8.00448E-04, 0.00000E+00)

 _naup2: no. of "converged" Ritz values at this iter.
 ----------------------------------------------------
    1 -    1:       0

 _napps: matrix splitting at row/column no.
 ------------------------------------------
    1 -    1:       2

 _napps: matrix splitting with shift number.
 -------------------------------------------
    1 -    1:       2

 _napps: off diagonal element.
 -----------------------------
    1 -    1: ( 1.06812E-04, 2.48815E-05)

 _naup2: **** Start of major iteration number ****
 -------------------------------------------------
    1 -    1:    1016

 _ngets: KEV is
 --------------
    1 -    1:       1

 _ngets: NP is
 -------------
    1 -    1:       2

 _ngets: Eigenvalues of current H matrix
 ----------------------------------------
    1 -    2: ( 8.43747E+02, 8.43747E+02)  ( 1.50688E+02, 1.50688E+02)
    3 -    3: ( 1.00004E+03, 1.00004E+03)

 _ngets: Ritz estimates of the current KEV+NP Ritz values
 --------------------------------------------------------
    1 -    2: (-5.04610E+01,-3.56466E+02)  (-1.98433E+02,-2.97650E+02)
    3 -    3: ( 7.49836E-04, 0.00000E+00)

 _naup2: no. of "converged" Ritz values at this iter.
 ----------------------------------------------------
    1 -    1:       0

 _napps: matrix splitting at row/column no.
 ------------------------------------------
    1 -    1:       2

 _napps: matrix splitting with shift number.
 -------------------------------------------
    1 -    1:       2

 _napps: off diagonal element.
 -----------------------------
    1 -    1: (-6.10352E-05, 3.05176E-05)

 _naup2: **** Start of major iteration number ****
 -------------------------------------------------
    1 -    1:    1017

 _ngets: KEV is
 --------------
    1 -    1:       1

 _ngets: NP is
 -------------
    1 -    1:       2

 _ngets: Eigenvalues of current H matrix
 ----------------------------------------
    1 -    2: ( 1.43426E+02, 1.43426E+02)  ( 8.61754E+02, 8.61754E+02)
    3 -    3: ( 1.00004E+03, 1.00004E+03)

 _ngets: Ritz estimates of the current KEV+NP Ritz values
 --------------------------------------------------------
    1 -    2: (-2.77857E+02,-2.09155E+02)  (-4.92123E-10,-3.44731E+02)
    3 -    3: ( 7.87159E-04, 0.00000E+00)

 _naup2: no. of "converged" Ritz values at this iter.
 ----------------------------------------------------
    1 -    1:       0

 _napps: matrix splitting at row/column no.
 ------------------------------------------
    1 -    1:       2

 _napps: matrix splitting with shift number.
 -------------------------------------------
    1 -    1:       2

 _napps: off diagonal element.
 -----------------------------
    1 -    1: ( 0.00000E+00, 4.01403E-05)

 _naup2: **** Start of major iteration number ****
 -------------------------------------------------
    1 -    1:    1018

 _ngets: KEV is
 --------------
    1 -    1:       1

 _ngets: NP is
 -------------
    1 -    1:       2

 _ngets: Eigenvalues of current H matrix
 ----------------------------------------
    1 -    2: ( 8.43746E+02, 8.43746E+02)  ( 1.50689E+02, 1.50689E+02)
    3 -    3: ( 1.00004E+03, 1.00004E+03)

 _ngets: Ritz estimates of the current KEV+NP Ritz values
 --------------------------------------------------------
    1 -    2: ( 2.25260E+02,-2.80842E+02)  ( 2.97651E+02,-1.98434E+02)
    3 -    3: ( 7.37387E-04, 1.87069E-10)

 _naup2: no. of "converged" Ritz values at this iter.
 ----------------------------------------------------
    1 -    1:       0

 _napps: matrix splitting at row/column no.
 ------------------------------------------
    1 -    1:       2

 _napps: matrix splitting with shift number.
 -------------------------------------------
    1 -    1:       2

 _napps: off diagonal element.
 -----------------------------
    1 -    1: ( 0.00000E+00, 6.91544E-06)

 _naup2: **** Start of major iteration number ****
 -------------------------------------------------
    1 -    1:    1019

 _ngets: KEV is
 --------------
    1 -    1:       1

 _ngets: NP is
 -------------
    1 -    1:       2

 _ngets: Eigenvalues of current H matrix
 ----------------------------------------
    1 -    2: ( 1.43426E+02, 1.43426E+02)  ( 8.61754E+02, 8.61754E+02)
    3 -    3: ( 1.00004E+03, 1.00004E+03)

 _ngets: Ritz estimates of the current KEV+NP Ritz values
 --------------------------------------------------------
    1 -    2: (-9.85221E+01,-3.33534E+02)  (-3.44731E+02, 0.00000E+00)
    3 -    3: ( 0.00000E+00, 0.00000E+00)

 _naupd: Number of update iterations taken
 -----------------------------------------
    1 -    1:    1019

 _naupd: Number of wanted "converged" Ritz values
 ------------------------------------------------
    1 -    1:       1

 _naupd: The final Ritz values
 -----------------------------
    1 -    1: ( 1.00004E+03, 1.00004E+03)

 _naupd: Associated Ritz estimates
 ---------------------------------
    1 -    1: ( 0.00000E+00, 0.00000E+00)

     =============================================
     = Complex implicit Arnoldi update code      =
     = Version Number:  2.1                      =
     = Version Date:    3/19/97                 =
     =============================================
     = Summary of timing statistics              =
     =============================================

     Total number update iterations             =  1019
     Total number of OP*x operations            =  2039
     Total number of B*x operations             =     0
     Total number of reorthogonalization steps  =  2039
     Total number of iterative refinement steps =     0
     Total number of restart steps              =     0
     Total time in user OP*x operation          =     0.000000
     Total time in user B*x operation           =     0.000000
     Total time in Arnoldi update routine       =     0.000000
     Total time in p_naup2 routine              =     0.000000
     Total time in basic Arnoldi iteration loop =     0.000000
     Total time in reorthogonalization phase    =     0.000000
     Total time in (re)start vector generation  =     0.000000
     Total time in Hessenberg eig. subproblem   =     0.000000
     Total time in getting the shifts           =     0.000000
     Total time in applying the shifts          =     0.000000
     Total time in convergence testing          =     0.000000
     Total time in computing final Ritz vectors =     0.000000

rank 0rank 1 -  - 1000.04 1000.041000.04 1000.04

<end of output>
Test time =   0.16 sec
----------------------------------------------------------
Test Passed.
"icb_parpack_cpp_tst" end time: Feb 20 00:01 GMT
"icb_parpack_cpp_tst" time elapsed: 00:00:00
----------------------------------------------------------

End testing: Feb 20 00:01 GMT

Callstack

n.a.

Notes, remarks

switching to openmpi resolves the issue

fghoussen commented 4 years ago

I guess you did but can you confirm cmake "summary" says what's expected (i.e., cmake founds mpich libs as you would have used in a hand-made Makefile)

Like so but with mpich (in case you have an environment problem [bashrc, several mpi instal, ...], you may don't fish what you expect)

>> cmake .
...
   -- MPICC:
      -- compile: /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi
      -- compile: /usr/lib/x86_64-linux-gnu/openmpi/include
      -- link:    /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so
   -- MPICXX:
      -- compile: /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi
      -- compile: /usr/lib/x86_64-linux-gnu/openmpi/include
      -- link:    /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so
      -- link:    /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so

dschwoerer commented 4 years ago

with mpich:

   -- MPIFC:
      -- compile: /usr/include/mpich-x86_64
      -- compile: /usr/lib64/gfortran/modules/mpich
      -- link:    /usr/lib64/mpich/lib/libmpifort.so
      -- link:    /usr/lib64/mpich/lib/libmpi.so
   -- MPICC:
      -- compile: /usr/include/mpich-x86_64
      -- link:    /usr/lib64/mpich/lib/libmpi.so
   -- MPICXX:
      -- compile: /usr/include/mpich-x86_64
      -- link:    /usr/lib64/mpich/lib/libmpicxx.so
      -- link:    /usr/lib64/mpich/lib/libmpi.so

and with openmpi:

$ module switch mpi/mpich-x86_64 mpi/openmpi-x86_64
$ cmake -DEXAMPLES=ON -DMPI=ON -DICB=ON .. && make -j 4 && make test
...
   -- MPIFC:
      -- compile: /usr/include/mpich-x86_64
      -- compile: /usr/lib64/gfortran/modules/mpich
      -- link:    /usr/lib64/mpich/lib/libmpifort.so
      -- link:    /usr/lib64/mpich/lib/libmpi.so
   -- MPICC:
      -- compile: /usr/include/mpich-x86_64
      -- link:    /usr/lib64/mpich/lib/libmpi.so
   -- MPICXX:
      -- compile: /usr/include/mpich-x86_64
      -- link:    /usr/lib64/mpich/lib/libmpicxx.so
      -- link:    /usr/lib64/mpich/lib/libmpi.so

(all tests pass)

$ rm -rf .* *
$ cmake -DEXAMPLES=ON -DMPI=ON -DICB=ON .. && make -j 4 && make test
   -- MPIFC:
      -- compile: /usr/include/openmpi-x86_64
      -- compile: /usr/lib64/openmpi/lib
      -- link:    /usr/lib64/openmpi/lib/libmpi_usempif08.so
      -- link:    /usr/lib64/openmpi/lib/libmpi_usempi_ignore_tkr.so
      -- link:    /usr/lib64/openmpi/lib/libmpi_mpifh.so
      -- link:    /usr/lib64/openmpi/lib/libmpi.so
   -- MPICC:
      -- compile: /usr/include/openmpi-x86_64
      -- link:    /usr/lib64/openmpi/lib/libmpi.so
   -- MPICXX:
      -- compile: /usr/include/openmpi-x86_64
      -- link:    /usr/lib64/openmpi/lib/libmpi_cxx.so
      -- link:    /usr/lib64/openmpi/lib/libmpi.so

(all tests pass)

So the printed summary by cmake is wrong :-) However, with openmpi it passes, with mpich it fails. rm -rf .* * only changes the summary from cmake, not the results.

The setup is default fedora setup, never had any issues with mpi ...

fghoussen commented 4 years ago

Just to make sure : if you rm CMakeCache.txt after module switch mpi/mpich, does it make mpich job succeed ? Cmake keep track of what was found previously in the cache : this may screw the build (switch but no rm cache). If you always build from scratch : this can not be your problem.

dschwoerer commented 4 years ago

Sorry, should have been more clear. I tried rm -rf .* * (which should delete any cache) and the only thing that changes is the line written, but not result of the tests. They are the same: mpich fails and openmpi passes.

dschwoerer commented 4 years ago

Sorry, should have looked at the log, and not just attach part of it -.-

Was rather trivial to fix (5c2a80e )

dschwoerer commented 4 years ago

Maybe we should suggest to use grep Fail -B 100 rather then tail -n 300?

fghoussen commented 4 years ago

They are the same: mpich fails and openmpi passes.

OK, so, mpich fails. If you could fix that, would be good you also PR a mpich-job in CI with the fix

Maybe we should suggest to use grep Fail -B 100 rather then tail -n 300?

Pros : make smaller logs. Cons : when problems show up on CI, having some context is really helpful.

If you "only" grep Fail -B 100 you could miss grep [Ee]rror -B 100 for instance. Maybe grepping last 50 lines of [Ff]ail, and, also grepping last 50 lines of [Ee]rror, and keeping tail but only the 100 or 50 last lines ? In case this is done, would be good to do that for all jobs in .travis.yml

opencollab / arpack-ng