Belos_gcrodr_hb_MPI_4 failing in ATDM builds on mutrino

fryeguy52 commented 5 years ago

CC: @trilinos/belos , @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe

Next Action Status

PR #3951 merged to 'develop' on 11/28/2018 resulted in this test passing in the Intel 18.0.2 builds on 'mutrino' and the 'cee-rhel6' builds on 12/1/2018 and in all builds for several days as of 12/3/2018.

Description

As shown in this query the test:

Belos_gcrodr_hb_MPI_4

is failing in the builds:

Trilinos-atdm-mutrino-intel-opt-openmp-HSW
Trilinos-atdm-mutrino-intel-opt-openmp-KNL

some test output:

*** Error in `/lscratch1/jenkins/mutrino-slave/workspace/Trilinos-atdm-mutrino-intel-opt-openmp-HSW/SRC_AND_BUILD/BUILD/packages/belos/epetra/test/GCRODR/Belos_gcrodr_hb.exe': free(): invalid pointer: 0x000001000011bba0 ***
*** Error in `/lscratch1/jenkins/mutrino-slave/workspace/Trilinos-atdm-mutrino-intel-opt-openmp-HSW/SRC_AND_BUILD/BUILD/packages/belos/epetra/test/GCRODR/Belos_gcrodr_hb.exe': free(): invalid pointer: 0x00000100004b4980 ***
*** Error in `/lscratch1/jenkins/mutrino-slave/workspace/Trilinos-atdm-mutrino-intel-opt-openmp-HSW/SRC_AND_BUILD/BUILD/packages/belos/epetra/test/GCRODR/Belos_gcrodr_hb.exe': free(): invalid pointer: 0x00000100004b4980 ***
*** Error in `/lscratch1/jenkins/mutrino-slave/workspace/Trilinos-atdm-mutrino-intel-opt-openmp-HSW/SRC_AND_BUILD/BUILD/packages/belos/epetra/test/GCRODR/Belos_gcrodr_hb.exe': free(): invalid pointer: 0x00000100004b4980 ***

Steps to Reproduce

One should be able to reproduce this failure on the machine mutrino as described in:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md More specifically, the commands given for the system mutrino are provided at:
https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#mutrino The exact commands to reproduce this issue should be:
```
$ cd <some_build_dir>/
```

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh intel-opt-openmp-HSW

$ cmake \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \ -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_MueLu=ON \ $TRILINOS_DIR

$ make -j16

$ salloc -N 1 -p standard -J $JOB_NAME ctest -j16

srajama1 commented 5 years ago

Is this a new failure or new configuration ?

@hkthorn : FYI

bartlettroscoe commented 5 years ago

Is this a new failure or new configuration ?

CC: @mhoemmen

@srajama1

Pretty clear this was due to commits to Trilinos from looking at CDash history:

It seems likely looking over these commits that it was caused by the merge of PR #3467 by @mhoemmen on 9/21/2018 that hand changes to Belos.

mhoemmen commented 5 years ago

@bartlettroscoe Not quite sure how that's possible, given that #3467 doesn't touch that code path at all.

mhoemmen commented 5 years ago

In any case, https://github.com/trilinos/Trilinos/issues/3493 is a higher priority. It looks like I'll need to work on that first.

bartlettroscoe commented 5 years ago

@bartlettroscoe Not quite sure how that's possible, given that #3467 doesn't touch that code path at all.

@mhoemmen, those were the only changes that I could see that where pulled that day shown here that could impact Belos and Anasazi tests. Do you see any other changes pulled that day that might account for that? We could look to see if there were any env changes but that would seem unlikely. (That is, Belos is hardly every updated and the one day out of many days it is updated, these failures start.)

mhoemmen commented 5 years ago

@bartlettroscoe I'll look at it; I just wanted to express my informed opinion that my changes did not touch that code path and likely are not relevant.

bartlettroscoe commented 5 years ago

@bartlettroscoe I'll look at it; I just wanted to express my informed opinion that my changes did not touch that code path and likely are not relevant.

@mhoemmen, it seems it should be easy to rule out an env change by simply building the version of Trilinos just before this PR was merged in and see if the Belos and Anasazi tests fail or not. And then one could (manually) bisect the commits in this PR to find the commit (or range of commits) that triggered the failures. If that version of Trilinos shows the same failures, then I will own you lunch next time in am in town :-)

mhoemmen commented 5 years ago

awww @bartlettroscoe you don't need to owe me lunch :-) I'd like to see if I can replicate this first. I just need to work on #3493 first.

mhoemmen commented 5 years ago

I submitted a fix for #3493: https://github.com/trilinos/Trilinos/pull/3538

Now I can work on this. I am building on mutrino now. @bartlettroscoe did an excellent job with that script -- it works with no trouble :-D

mhoemmen commented 5 years ago

@bartlettroscoe Hm, perhaps not... the script didn't appear to build Belos' Epetra tests.

$ salloc -N 1 -p standard -J Trilinos-Issue-3497 ctest -V -R gcrodr
salloc: Granted job allocation 12033715
salloc: Waiting for resource configuration
salloc: Nodes nid00106 are ready for job
UpdateCTestConfiguration  from :/home/mhoemme/prj/Trilinos/DartConfiguration.tcl
Parse Config file:/home/mhoemme/prj/Trilinos/DartConfiguration.tcl
 Add coverage exclude regular expressions.
SetCTestConfiguration:CMakeCommand:/projects/netpub/atdm/cmake-3.11.4/bin/cmake
UpdateCTestConfiguration  from :/home/mhoemme/prj/Trilinos/DartConfiguration.tcl
Parse Config file:/home/mhoemme/prj/Trilinos/DartConfiguration.tcl
Test project /home/mhoemme/prj/Trilinos
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
No tests were found!!!
salloc: Relinquishing job allocation 12033715

mhoemmen commented 5 years ago

@bartlettroscoe I think I see what happened. The "reproducer" script isn't actually -- we have to change it to make it build the packages we want. Let me try that.

mhoemmen commented 5 years ago

I set Trilinos_ENABLE_Belos=ON and Tpetra_ENABLE_Epetra=ON, and was able to get the test built and to replicate the test failure:

test 93
    Start 93: Belos_gcrodr_hb_MPI_4

93: Test command: /opt/slurm/bin/srun "--mpi=pmi2" "--ntasks-per-node" "36" "--ntasks" "4" "-c 4" "/home/mhoemme/prj/Trilinos/packages/belos/epetra/test/GCRODR/Belos_gcrodr_hb.exe" "--debug" "--verbose" "--filename=sherman5.hb" "--tol=1e-4" "--num-rhs=2" "--max-subspace=61" "--recycle=23" "--max-cycles=75"
93: Test timeout computed to be: 600
93: Reading matrix info from sherman5.hb...
93: ***************************************************************
93: Matrix in file sherman5.hb is 3312 x 3312,
93: with 20793 nonzeros with type RUA;
93: ***************************************************************
93: Title: 1U FULLY IMPLICIT BLACK OIL SIMULATOR   16 BY 23 BY  3 GRID, THREE UNK
93: ***************************************************************
93: 1 right-hand-side(s) available.
93: Reading the matrix from sherman5.hb...
93: Setting  random exact solution  vector
93:
93:
93: Max norm of residual        =    2.294e-13
93: Two norm of residual        =    8.339e-13
93: Scaled two norm of residual =    1.941e-16
93: The residual using CSC format and exact solution is    1.941e-16
93: Norm of computed b = 4295.55
93: Norm of given b    = 4295.55
93: Norm of difference between computed b and given b for xexact = 8.22367e-13
93:
93:
93: Dimension of matrix: 3312
93: Number of right-hand sides: 2
93: Max number of restarts allowed: 75
93: Max number of iterations per restart cycle: 3311
93: Relative residual tolerance: 0.0001
93:
93:  No recycled subspace available for RHS index 0
93:
93: *** Error in `/home/mhoemme/prj/Trilinos/packages/belos/epetra/test/GCRODR/Belos_gcrodr_hb.exe': free(): invalid pointer: 0x0000010000118c80 ***
93: *** Error in `/home/mhoemme/prj/Trilinos/packages/belos/epetra/test/GCRODR/Belos_gcrodr_hb.exe': free(): invalid pointer: 0x00000100004b4850 ***
93: *** Error in `/home/mhoemme/prj/Trilinos/packages/belos/epetra/test/GCRODR/Belos_gcrodr_hb.exe': free(): invalid pointer: 0x00000100004b4850 ***
93: *** Error in `/home/mhoemme/prj/Trilinos/packages/belos/epetra/test/GCRODR/Belos_gcrodr_hb.exe': free(): invalid pointer: 0x00000100004b4850 ***

mhoemmen commented 5 years ago

@bartlettroscoe ok, so now what? Can I change this build so that it's a debug build? It looks like I can add CMake options to enable RCP Node tracing etc.

bartlettroscoe commented 5 years ago

@bartlettroscoe ok, so now what? Can I change this build so that it's a debug build? It looks like I can add CMake options to enable RCP Node tracing etc.

@mhoemmen, we have to be careful about changing settings because they can break downstream ATDM APPs. For example, EMPIRE reports that they can't sets Trilinos_ENABLE_DEBUG=ON because of some issue that I can't remember. But one could argue that we should be able to have a more debuggy build that submits to CDash and then the build that the APPs use can turn some of this debug checking off.

mhoemmen commented 5 years ago

@bartlettroscoe Oh, don't worry, I don't want to change the Dashboard build settings. I just want to change the local mutrino settings for my own testing. Can I just append CMake options (in the same way that I appended -DTrilinos_ENABLE_Belos=ON) to turn this into a debug build?

bartlettroscoe commented 5 years ago

Yes. Any options you set in the cmake commandline should override those set internally.

mhoemmen commented 5 years ago

ok be patient y'all, I had to take a couple days off for something so I won't be working on this now

fryeguy52 commented 5 years ago

This is also failing on the intel-opt build on the KNL see here

hkthorn commented 5 years ago

@fryeguy52 Good to know, thanks!

bartlettroscoe commented 5 years ago

FYI: This has been failing every day since 9/22/2018

hkthorn commented 5 years ago

I think that the integration of the Tpetra-specific solvers in Belos is not the issue, but a red herring. If I go back in the ACES log for Mutrino, there was a realignment in the software stack during the transition to ATCC6. This platform change also coincides with these test failures and given the tests failing in Anasazi and Belos, it might be more relevant.

bartlettroscoe commented 5 years ago

I think that the integration of the Tpetra-specific solvers in Belos is not the issue, but a red herring. If I go back in the ACES log for Mutrino, there was a realignment in the software stack during the transition to ATCC6. This platform change also coincides with these test failures and given the tests failing in Anasazi and Belos, it might be more relevant.

@hkthorn, okay, I found an email send to mutrino-users on 9/21/2018 saying that the modules were updated. That could impact the builds on 9/22/2018.

Question is, is this a defect in the env and nothing wrong with Trilinos code or tests or is that a change in an otherwise valid updated env that triggered a latent defect in Trilinos code or tests?

I added a new label "ATDM Env Issue" for ATDM Trilinos GitHub issues like this that may be caused by the env (or at least is triggered by a change in the env).

hkthorn commented 5 years ago

@bartlettroscoe Very good question. I just couldn't see the reason why those solvers being committed were the cause of this. If it is a result of the environment, that gives me a different approach to take to find it out what the issue is.

hkthorn commented 5 years ago

@bartlettroscoe From what I am able to track down, there is an issue with the destructor of a Teuchos::SerialDenseMatrix in the GCRODR solver. However, it is a matrix that is valid and should have no such problem. Is there a debug environment for Mutrino? All I see are optimized builds but that doesn't allow me to dig too deep into the symbols.

bartlettroscoe commented 5 years ago

All I see are optimized builds but that doesn't allow me to dig too deep into the symbols.

@hkthorn, we don't run a full debug build just because the builds and tests on 'mutrino' are so expensive. But you should be able to do a full debug build using:

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh intel-debug-openmp-HSW

Please let me know if that does not work.

(NOTE: If debug builds on 'mutrino' are broken, we might do a reduced full debug build for say just the faster Panzer test suite, just to make sure we maintain the ability to create full debug builds.)

fryeguy52 commented 5 years ago

This is also failing in the build Trilinos-atdm-cee-rhel6-intel-18.0.2-mpich2-3.2-serial-static-opt shown here

bartlettroscoe commented 5 years ago

CC: @srajama1

@hkthorn

FYI: This is not a bug with the env on 'mutrino'. Instead, this a problem with the code with Intel 18. What changed in the env on 'mutrino' was the Intel compiler version! If : The compiler on 'mutrino' changed from Intel 17.0.2 on 9/21/2018 as shown here to Intel 18.0.2 on 9/22/2018 as shown here. (See test history). So this could be a defect in Intel 18 but if it is, it is the same defect impacting the Intel 18 builds on the CEE RHEL6 machines and on the Intel 18 builds on 'mutrino' (very different arch).

Therefore, this is not an env issue so we removed the "ATDM Env Issue" label.

srajama1 commented 5 years ago

Can I ask how did we decide to update the Intel version from 17. to 18. ? What tests were used to make this decision ?

bartlettroscoe commented 5 years ago

Can I ask how did we decide to update the Intel version from 17. to 18. ? What tests were used to make this decision ?

Apparently Intel 18 become the new default Intel compiler on 'mutrino' on 9/21/2018 (and I would assume also Trinity as well.). Looks like it is the new default for Sierra too as Intel 18 is the new default for ATDM SPARC as well.

bartlettroscoe commented 5 years ago

@srajama1, Micah H. explains in https://sems-atlassian-son.sandia.gov/jira/browse/TRIL-212?focusedCommentId=24475&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24475, Intel 18.0.2 is used at ATS-1 ('Trinity'). Frankly, Trilinos needs to be testing with updated compilers long before out customers do. We should have been testing with Intel 18 a long time ago.

hkthorn commented 5 years ago

For the GCRODR seg fault, it looks like something might be going on in the Intel 18 MKL library, there are several invalid reads and writes and conditional jumps.

==241580== Conditional jump or move depends on uninitialised value(s) ==241580== at 0x88A1AE0: mkl_lapack_dgehrd (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_core.so) ==241580== by 0x889AF8C: mkl_lapack_dgeev (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_core.so) ==241580== by 0x5679E1D: DGEEV (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so) ==241580== by 0x46E1FC: Belos::GCRODRSolMgr<double, Epetra_MultiVector, Epetra_Operator, true>::getHarmonicVecs1(int, Teuchos::SerialDenseMatrix<int, double> const&, Teuchos::SerialDenseMatrix<int, double>&) (BelosGCRODRSolMgr.hpp:2116) ==241580== by 0x468131: Belos::GCRODRSolMgr<double, Epetra_MultiVector, Epetra_Operator, true>::solve() (BelosGCRODRSolMgr.hpp:1529) ==241580== by 0x41A924: main (test_gcrodr_hb.cpp:205) ==241580== ==241580== Invalid write of size 4 ==241580== at 0x1AA0DB64: mkl_blas_avx_xdgemv (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_avx.so) ==241580== by 0x606036B: mkl_blas_dgemv (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_intel_thread.so) ==241580== by 0x8917000: mkl_lapack_dlahr2 (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_core.so) ==241580== by 0x88A154D: mkl_lapack_dgehrd (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_core.so) ==241580== by 0x889AF8C: mkl_lapack_dgeev (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_core.so) ==241580== by 0x5679E1D: DGEEV (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so) ==241580== by 0x46E1FC: Belos::GCRODRSolMgr<double, Epetra_MultiVector, Epetra_Operator, true>::getHarmonicVecs1(int, Teuchos::SerialDenseMatrix<int, double> const&, Teuchos::SerialDenseMatrix<int, double>&) (BelosGCRODRSolMgr.hpp:2116) ==241580== by 0x468131: Belos::GCRODRSolMgr<double, Epetra_MultiVector, Epetra_Operator, true>::solve() (BelosGCRODRSolMgr.hpp:1529) ==241580== by 0x41A924: main (test_gcrodr_hb.cpp:205) ==241580== Address 0x1f477da8 is 8 bytes after a block of size 8,000 alloc'd ==241580== at 0x4C2A1E3: operator new(unsigned long) (vg_replace_malloc.c:334) ==241580== by 0x46F5AA: allocate (new_allocator.h:104) ==241580== by 0x46F5AA: allocate (alloc_traits.h:357) ==241580== by 0x46F5AA: _M_allocate (stl_vector.h:170) ==241580== by 0x46F5AA: _M_create_storage (stl_vector.h:185) ==241580== by 0x46F5AA: _Vector_base (stl_vector.h:136) ==241580== by 0x46F5AA: _Vector_base (stl_vector.h:134) ==241580== by 0x46F5AA: vector (stl_vector.h:278) ==241580== by 0x46F5AA: Belos::GCRODRSolMgr<double, Epetra_MultiVector, Epetra_Operator, true>::getHarmonicVecs1(int, Teuchos::SerialDenseMatrix<int, double> const&, Teuchos::SerialDenseMatrix<int, double>&) (BelosGCRODRSolMgr.hpp:2083) ==241580== by 0x468131: Belos::GCRODRSolMgr<double, Epetra_MultiVector, Epetra_Operator, true>::solve() (BelosGCRODRSolMgr.hpp:1529) ==241580== by 0x41A924: main (test_gcrodr_hb.cpp:205) ==241580== ==241580== Invalid read of size 16 ==241580== at 0x1B05764F: ??? (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_avx.so) ==241580== by 0xF8: ??? ==241580== by 0xF6: ??? ==241580== by 0x5C29AAF: ??? (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so) ==241580== by 0x822B5D1: ??? (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_core.so) ==241580== by 0x9B7F792: mkl_pds_lp64_pds_slv_fwd_sym_pos_single_real (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_core.so) ==241580== by 0x1FFEFFB037: ??? ==241580== by 0x1FFEFFAC0F: ??? ==241580== by 0x1D237AEF: ??? (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_avx.so) ==241580== by 0x1AA0D749: mkl_blas_avx_xdgemv (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_avx.so) ==241580== by 0x1F477DDF: ??? ==241580== by 0x1AA0C536: mkl_blas_avx_xdgemv (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_avx.so) ==241580== Address 0x1f477da8 is 8 bytes after a block of size 8,000 alloc'd ==241580== at 0x4C2A1E3: operator new(unsigned long) (vg_replace_malloc.c:334) ==241580== by 0x46F5AA: allocate (new_allocator.h:104) ==241580== by 0x46F5AA: allocate (alloc_traits.h:357) ==241580== by 0x46F5AA: _M_allocate (stl_vector.h:170) ==241580== by 0x46F5AA: _M_create_storage (stl_vector.h:185) ==241580== by 0x46F5AA: _Vector_base (stl_vector.h:136) ==241580== by 0x46F5AA: _Vector_base (stl_vector.h:134) ==241580== by 0x46F5AA: vector (stl_vector.h:278) ==241580== by 0x46F5AA: Belos::GCRODRSolMgr<double, Epetra_MultiVector, Epetra_Operator, true>::getHarmonicVecs1(int, Teuchos::SerialDenseMatrix<int, double> const&, Teuchos::SerialDenseMatrix<int, double>&) (BelosGCRODRSolMgr.hpp:2083) ==241580== by 0x468131: Belos::GCRODRSolMgr<double, Epetra_MultiVector, Epetra_Operator, true>::solve() (BelosGCRODRSolMgr.hpp:1529) ==241580== by 0x41A924: main (test_gcrodr_hb.cpp:205) ==241580==

I am not seeing errors from valgrind when the same test is built with GCC 4.9.3 or 7.2 and a non-MKL LAPACK. It does appear that the choice for work is not optimal, as it is hardcoded to 4*N, which is the lower bound for the work vector size. Usually, if the LAPACK library does not like 'lwork' it will just return an error and not perform invalid memory accesses.

hkthorn commented 5 years ago

If I used the suggested optimal 'lwork', the number of memory errors in the MKL is reduced to one conditional jump in DGEEV:

==211606== Conditional jump or move depends on uninitialised value(s) ==211606== at 0x88A1AE0: mkl_lapack_dgehrd (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_core.so) ==211606== by 0x889AF8C: mkl_lapack_dgeev (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_core.so) ==211606== by 0x5679E1D: DGEEV (in /projects/global/toss3/compilers/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so) ==211606== by 0x46E223: Belos::GCRODRSolMgr<double, Epetra_MultiVector, Epetra_Operator, true>::getHarmonicVecs1(int, Teuchos::SerialDenseMatrix<int, double> const&, Teuchos::SerialDenseMatrix<int, double>&) (BelosGCRODRSolMgr.hpp:2119) ==211606== by 0x468131: Belos::GCRODRSolMgr<double, Epetra_MultiVector, Epetra_Operator, true>::solve() (BelosGCRODRSolMgr.hpp:1529) ==211606== by 0x41A924: main (test_gcrodr_hb.cpp:205) ==211606==

I am performing these experiments on Chama, so I will try this modification on Mutrino to see if it fixes the failing test.

mhoemmen commented 5 years ago

@hkthorn Woah! Could this possibly be an LAPACK bug? It would be surprising for Intel to rewrite an eigensolver driver routine.

hkthorn commented 5 years ago

With the modifications to GCRO-DR such that it gets the optimal storage size and then uses it to allocate the work vector for GEEV, the test passes on mutrino using the above mentioned "steps to reproduce"

Start 33: Belos_gcrodr_hb_MPI_4 33/96 Test #33: Belos_gcrodr_hb_MPI_4 .................................................................................. Passed 2.74 sec Start 34: Belos_prec_gcrodr_hb_0_MPI_4 34/96 Test #34: Belos_prec_gcrodr_hb_0_MPI_4 ........................................................................... Passed 1.76 sec Start 35: Belos_prec_gcrodr_hb_1_MPI_4 35/96 Test #35: Belos_prec_gcrodr_hb_1_MPI_4 ........................................................................... Passed 1.16 sec

dridzal commented 5 years ago

This is clearly an MKL bug.

@hkthorn , out of curiosity, is the optimal storage size significantly larger than 4*N (which is discussed in GEEV's documentation)?

hkthorn commented 5 years ago

@dridzal For the GCRO-DR test that was failing with a seg fault due to GEEV, the difference between the suggested 'lwork' and the optimal 'lwork' was

lwork is 244 and the optimal size is 2074

Yes, that is nearly an order of magnitude difference.

dridzal commented 5 years ago

Straight from Intel's documentation, https://software.intel.com/en-us/mkl-developer-reference-fortran-geev:

If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task.

The lwork size for computations without eigenvectors must satisfy:

lwork≥ max(1, 3n)

With eigenvectors

lwork≥ max(1, 4n)

In ROL, @trilinos/rol , we always use 4n, to be "safe", see Issue #3914. MKL-18 apparently violates this contract.

dridzal commented 5 years ago

Oh, this is good ... look at the very first entry on the Intel MKL bug fixes page, MKLD-3796:

https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-2018-bug-fixes-list

We obviously need Intel(R) MKL 2018 Update 4 ( September 2018).

bartlettroscoe commented 5 years ago

@hkthorn said:

With the modifications to GCRO-DR such that it gets the optimal storage size and then uses it to allocate the work vector for GEEV, the test passes on mutrino using the above mentioned "steps to reproduce"

@hkthorn, can you post a PR to make this change? Is there any downside to making this change for all platforms? If so, if you can expose a CMake cache var so that we can change to this implementation for just these Intel 18 builds to address this issue?

Since I think this is the default production env on Trinity (which 'mutrino' is just trying to mirror), it would be good to fix this without requiring an upgrade of MKL (since that could take a long time) if that is a reasonable change to make.

Please let me know. These tests have been failing for a long time.

I am adding back the "SEMS Env Issue" label since this due to a defect in the env (but to a production env, not a test bed).

hkthorn commented 5 years ago

@bartlettroscoe @srajama1 While I can make this change to Belos and Anasazi, there are other packages like ROL, LOCA, and Stokhos that may also use this method. The fact that Belos and Anasazi (and ROL) are the only ones to exhibit a seg fault on this one platform (Mutrino) is not to say they are the only ones to have issues with using this method. The bad memory access for GEEV/GEES from the MKL 18.0.2 can clearly be seen on other platforms, like Chama, using Valgrind. Even though it doesn't result in a seg fault on Chama, an application using the GEES/GEEV methods directly or indirectly can observe unexplained numerical behaviors. Merely changing how Belos and Anasazi use these methods just to get the tests to pass and clear the board feels like sweeping this issue under the rug. These two eigensolver routines are often used inside linear solvers, eigensolvers, and preconditioners (not just in Trilinos) to get local spectral information. We need to make it clear, to anyone who has sway about the production MKL version for Trinity that it is imperative, for our applications, that they increase the version to at least update 4.

So, can you tell me who needs to be notified about this situation with MKL 18.0.2?

bartlettroscoe commented 5 years ago

@hkthorn, thanks for the full context for this.

So, can you tell me who needs to be notified about this situation with MKL 18.0.2?

We can start with 'mutrino-help'. I will send them an email right now.

bartlettroscoe commented 5 years ago

CC: @hkthorn, @dridzal

Here is the email I just sent:

From: Bartlett, Roscoe A Sent: Monday, November 26, 2018 12:26 PM To: Mutrino-Help ... Cc: Thornquist, Heidi K ...; Ridzal, Denis ... Subject: Need upgrade of MKL to avoid MKL defect

Hello Mutrino admins,

It has been discovered that a defect in the MKL rountine GEEV causes undefined behavior and segfaults when used as documented. This is causing segfaults in important Trilinos solvers that are used by various SNL customers (and are causing failures in several Trilinos tests). The defect is listed as MKLD-3796 at:

https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-2018-bug-fixes-list

and it appears to be fixed in “Intel(R) MKL 2018 Update 4 ( September 2018)”.

Would it be possible to upgrade the MKL version with Intel 18.0.2 on ‘mutrino’ for this MLK update?

Please let us know,

Thanks,

-Ross

hkthorn commented 5 years ago

Thanks Ross. I am stunned that Intel would make such an error in such a foundational LAPACK method.

I appreciate you advocating for an upgrade and I hope that those maintaining the machine can understand the needs of the applications for which the machines were built to support.

Thanks, Heidi

From: "Roscoe A. Bartlett" notifications@github.com Reply-To: trilinos/Trilinos reply@reply.github.com Date: Monday, November 26, 2018 at 10:27 AM To: trilinos/Trilinos Trilinos@noreply.github.com Cc: Heidi Thornquist hkthorn@sandia.gov, Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [trilinos/Trilinos] Belos_gcrodr_hb_MPI_4 failing in ATDM builds on mutrino (#3497)

CC: @hkthornhttps://github.com/hkthorn, @dridzalhttps://github.com/dridzal

Here is the email I just sent:

From: Bartlett, Roscoe A Sent: Monday, November 26, 2018 12:26 PM To: Mutrino-Help ... Cc: Thornquist, Heidi K ...; Ridzal, Denis ... Subject: Need upgrade of MKL to avoid MKL defect

Hello Mutrino admins,

It has been discovered that a defect in the MKL rountine GEEV causes undefined behavior and segfaults when used as documented. This is causing segfaults in important Trilinos solvers that are used by various SNL customers (and are causing failures in several Trilinos tests). The defect is listed as MKLD-3796 at:

https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-2018-bug-fixes-list

and it appears to be fixed in “Intel(R) MKL 2018 Update 4 ( September 2018)”.

Would it be possible to upgrade the MKL version with Intel 18.0.2 on ‘mutrino’ for this MLK update?

Please let us know,

Thanks,

-Ross

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/3497#issuecomment-441724957, or mute the threadhttps://github.com/notifications/unsubscribe-auth/APLX1p6KQ5nDxoe1hWNn60N2U2mmBnC2ks5uzCRigaJpZM4W5I1p.

bartlettroscoe commented 5 years ago

CC: @dridzal

I appreciate you advocating for an upgrade and I hope that those maintaining the machine can understand the needs of the applications for which the machines were built to support.

@hkthorn, let's see where this upgrade request goes. In the meantime, we will need to decide how to deal with these failing tests. First we need to see if any of that ATDM APPs are using functionality in these packages that might trigger this error and then we can go from there. What specific solvers in Belos, Anasazi and ROL call this LAPACK function GEEV? How can we determine if any SPARC or EMPIRE use cases use these solvers?

bartlettroscoe commented 5 years ago

@hkthorn, @dridzal, @fryeguy52

I got word back from the 'mutrino' admins that the will be installing a "friendly user" installation of “Intel(R) MKL 2018 Update 4 ( September 2018)” later this week that we can try out to see if it fixes this problem. Also, they will be doing an "friendly user" installation of “Intel(R) MKL 2019" as well soon (later this week?) for us to try out. They also said that they have to keep 'mutrino', 'trinitite' and 'trinity' (and 'voltrino') all in sync so upgrades on 'mutrino' will require upgrades on those other machines as well.

bartlettroscoe commented 5 years ago

Looks like PR #3951 merged to 'develop' on 11/28/2018 resulted in this test passing in the Intel 18.0.2 builds on 'mutrino' and the 'cee-rhel6' builds shown in the table below.

NOTE: We have been experiencing problems with builds getting results submitted to CDash for the last several days on 'mutrino' so this is the first day we have test results on 'mutrino' since the PR was merged. This test timed-out for some reason in the build Trilinos-atdm-cee-rhel6-intel-18.0.2-mpich2-3.2-serial-static-opt the day before so that is why the table below shows only one consecutive day passing.

Because of the timeout of this test in the build Trilinos-atdm-cee-rhel6-intel-18.0.2-mpich2-3.2-serial-static-opt, I think we should leave this issue open for a few more days to see if we see anymore timeouts in the build Trilinos-atdm-cee-rhel6-intel-18.0.2-mpich2-3.2-serial-static-opt.

Tests with issue trackers Passed: twip=10 (2018-12-01)

Site	Build Name	Test Name	Status	Details	Consecutive Pass Days	Non-pass Last 30 Days	Pass Last 30 Days	Tracker
cee-rhel6	Trilinos-atdm-cee-rhel6-intel-18.0.2-mpich2-3.2-serial-static-opt	Belos_gcrodr_hb_MPI_4	Passed	Completed	1	12	2	#3497
mutrino	Trilinos-atdm-mutrino-intel-opt-openmp-HSW	Belos_gcrodr_hb_MPI_4	Passed	Completed	1	21	1	#3497
mutrino	Trilinos-atdm-mutrino-intel-opt-openmp-KNL	Belos_gcrodr_hb_MPI_4	Passed	Completed	1	24	1	#3497

bartlettroscoe commented 5 years ago

FYI: As shown in this query this test passed the last three consecutive days and is still passing in all of the other builds so we can now close this issue. The workaround pushed in PR #3951 fixed this test!

bartlettroscoe commented 5 years ago

With the pending revert PR #4031, I am reopening this issue :-(

@fryeguy52, please let us know what happens with testing of Intel 18.0.5 on 'mutrino' to see if the updated MKL fixes this problem.

bartlettroscoe commented 5 years ago

As I said, reopening ...

hkthorn commented 5 years ago

@bartlettroscoe @fryeguy52 I am reverting the part of the commit that affected the LAPACK GEES routine in Anasazi. This does not affect the modifications to the LAPACK GEEV routine that were committed to Belos. Why is this being reopened? The PR #4031 has NOTHING to do with GCRODR in Belos.

trilinos / Trilinos