Anasazi_Epetra_BKS_norestart_test_MPI_4 failing in seveal ATDM builds.

fryeguy52 commented 5 years ago

CC: @trilinos/anasazi, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52

Next Action Status

Triggered by the PR #3951 merged to 'develop' on 10/28/2018 that worked around Intel 18.0.2 MKL GEEV defect. Next: Try updated Intel MKL 18.0.5 on 'mutrino' (with local revert of #3951) and see all of these failures go away (@fryeguy52) ...

Description

As shown in this query the test:

Anasazi_Epetra_BKS_norestart_test_MPI_4

is failing in the builds:

Trilinos-atdm-mutrino-intel-opt-openmp-HSW (since ???)
Trilinos-atdm-mutrino-intel-opt-openmp-KNL (since ???)
Trilinos-atdm-cee-rhel6-intel-17.0.1-intelmpi-5.1.2-serial-static-opt (since 11/30/2018)
Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt (11/29/2018 & 12/1/2018)
Trilinos-atdm-cee-rhel6-clang-5.0.1-openmpi-1.10.2-serial-static-opt (on 12/2/2018)
Trilinos-atdm-cee-rhel6-gnu-4.9.3-openmpi-1.10.2-serial-static-opt (on 12/10/2018)

Looks like some of these failures are random like shown for the build Trilinos-atdm-cee-rhel6-clang-5.0.1-openmpi-1.10.2-serial-static-opt and the build Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt.

The errors look like here for example:

Number of iterations performed in BlockKrylovSchur_test.exe: 30
Direct residual norms computed in BlockKrylovSchur_test.exe
          Eigenvalue            Residual
----------------------------------------
        1.199112e+05        1.296543e-07
        1.196455e+05        1.185550e-07
        1.192047e+05        4.530562e-04
        1.185918e+05        1.497329e-04
        1.178109e+05        4.552932e-04

End Result: TEST FAILED
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[25128,1],1]
  Exit code:    255
--------------------------------------------------------------------------
...

Current Status on CDash

The current status of these tests/builds for the current testing day can be found here

Steps to Reproduce

One should be able to reproduce this failure on a machine with a cee rhel6 environment as described in:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md

More specifically, the commands given for a machine with a cee rhel6 environment are provided at:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#cee-rhel6-environment

The exact commands to reproduce this issue should be:

$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-cee-rhel6-intel-17.0.1-intelmpi-5.1.2-serial-static-opt
$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Anasazi=ON \
 $TRILINOS_DIR
$ make NP=16
$ ctest -j16

srajama1 commented 5 years ago

I believe this is a new build, right ?

bartlettroscoe commented 5 years ago

I believe this is a new build, right ?

@srajama1, no, it is also failing in the promoted mutrino build Trilinos-atdm-mutrino-intel-opt-openmp-HSW as shown in this query.

@srajama1, I want to make sure that "new builds" is not a code word for "we are not going to address this anytime soon". Again, if you look at the big picture in ATDM, these 'cee-rhel6' builds are more important than every other of the promoted "ATDM" builds (except for the cuda-9.2 builds) because they protect SPARC which otherwise has no protection.

srajama1 commented 5 years ago

@bartlettroscoe Nope, it is not code for "we are not going to address this soon", just it is lower priority than "green that turned red recently or red in critical builds". We have to prioritize somehow. All of them can't be high priority. If you think these are higher priority than others, then I am really worried as we have been focusing on the wrong thing for the past few months. I appreciate your effort in helping Trilinos stability. I hope you can understand priorities are needed in a resource constrained environment.

hkthorn commented 5 years ago

@bartlettroscoe @srajama1 I don't see this error on my builds on Chama. Can we get an updated "Steps to Reproduce" that is Chama-related and not CEE related. I don't have access to a CEE machine. If all else fails, the test failures are both the failures that were plaguing Mutrino, for which I determined the issue with MKL GEES/GEEV. You can back out those changes to the Generalized Davidson and BKS solvers and let Mutrino fail until the Intel MKL is updated to a newer version.

fryeguy52 commented 5 years ago

@hkthorn this should work on chama, please note the WCID bit in salloc

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-chama-intel-opt-openmp

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Anasazi=ON \
  $TRILINOS_DIR

$ make NP=16

$ salloc -N1 --time=0:20:00 --account=<YOUR_WCID> ctest -j16

hkthorn commented 5 years ago

@fryeguy52 Thanks, I have a WCID, no problem.

hkthorn commented 5 years ago

According to the query (https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=testname&compare2=61&value2=Anasazi_Epetra_BKS_norestart_test_MPI_4&field3=buildstarttime&compare3=84&value3=2018-12-05T00%3A00%3A00&field4=buildstarttime&compare4=83&value4=2018-11-28T00%3A00%3A00&field5=status&compare5=61&value5=Failed) and the testing I'm performing on chama. This test does not fail on chama, just the CEE machines.

bartlettroscoe commented 5 years ago

@hkthorn and @fryeguy52, sorry my bad. I read the summary email wrong. It is also failing in the 'mutrino' builds as described in #3499. If you look at this query, you can see everywhere it fails.

bartlettroscoe commented 5 years ago

@hkthorn, looks like the tests on the 'cee-rhel6' builds and the 'mutrino' builds are all segfaulting. We need to add the error message seen in the tests to the Issue description.

bartlettroscoe commented 5 years ago

@hkthorn, everyone at SNL should be able to get access to a CEE machine at this point. See:

https://snl-wiki.sandia.gov/display/TRIL/Development+and+Testing+on+CEE+LAN+Machines

You should be able to get to a shared machine. Ask Denis. He knows.

hkthorn commented 5 years ago

@bartlettroscoe Yes, but the 'mutrino' failures stopped after PR #3951, if you look at the dates. That is not an issue with 'mutrino' any longer, just the 'cee-rhel6' builds.

bartlettroscoe commented 5 years ago

@bartlettroscoe Yes, but the 'mutrino' failures stopped after PR #3951, if you look at the dates. That is not an issue with 'mutrino' any longer, just the 'cee-rhel6' builds.

@hkthorn, okay, then let's get you access to a CEE LAN RHEL6 machine. It has been a long time since I tried to get access as I have had my CEE Blade for 2+ years now.

@fryeguy52, do you know the current strategy for getting on CEE LAN? Can we get @hkthorn access to my CEE machine 'ceerws1113' like you have?

fryeguy52 commented 5 years ago

@hkthorn, on nile you can subscribe to cee services. Under the Hardware/misc section, I have "SRN base" and "SRN Brokered Tier 1 Workstation" and can ssh to ceerws1113. I don't recall the turn around time for them to process requests but I think it is pretty fast

bartlettroscoe commented 5 years ago

Nope, it is not code for "we are not going to address this soon", just it is lower priority than "green that turned red recently or red in critical builds". We have to prioritize somehow. All of them can't be high priority. If you think these are higher priority than others, then I am really worried as we have been focusing on the wrong thing for the past few months. I appreciate your effort in helping Trilinos stability. I hope you can understand priorities are needed in a resource constrained environment.

@srajama1, understood. I just wanted to stress that w.r.t. SPARC, the 'cee-rhel6' builds are more important than every other already promoted "ATDM" build except for perhaps the cuda-9.2 builds on 'waterman'. I just wanted to stress that cleaning up the 'cee-rhel6' builds and promoting them is higher priority than cleaning up the existing Promoted 'ATDM' builds, IMO.

hkthorn commented 5 years ago

@bartlettroscoe @srajama1 It looks like the LAPACK GEES method that both GeneralizedDavidson and BKS are using has memory issues when one uses the optimal workspace size, causing bad eigenvectors or seg faults. The way to fix this, so that tests stop failing, is to revert my recent changes to both those solvers. This means that Mutrino will start failing for those tests again. Thoughts?

bartlettroscoe commented 5 years ago

It looks like the LAPACK GEES method that both GeneralizedDavidson and BKS are using has memory issues when one uses the optimal workspace size, causing bad eigenvectors or seg faults. The way to fix this, so that tests stop failing, is to revert my recent changes to both those solvers. This means that Mutrino will start failing for those tests again. Thoughts?

@hkthorn, wow, can't win, can we! Don't know what to do here. In any case, I have marked this with the label "ATDM Env Issue" to make it clear that this is caused by a problem in the env. Note that we are still being promised that they will put an updated Intel MKL on 'mutrino' that may fix these problems. But not sure if they would also update Intel MLK on the CEE LAN as well used by Sierra and SPARC.

FYI: So it might be that ATDM APPs are not even using Anasazi (see https://sems-atlassian-son.sandia.gov/jira/browse/TRIL-238). If that is the case, we might just disable Anasazi and all of these tests in ATDM Trilinos builds and APP builds of Trilinos going forward. But we have to confirm that. (Other non-ATDM users of Trilinos may just will have to fend for themselves if they are using Anasazi on these platforms with defective MKL.)

bartlettroscoe commented 5 years ago

@hkthorn, they report that they have Intel 18.0.5 installed on 'mutrino'. I will give this a try and see if it fixes this test.

hkthorn commented 5 years ago

@bartlettroscoe This test passes on 'mutrino' with the changes I made in the most recent PR. It is failing on the CEE machines. So, first you have to revert the changes to this solver to see if it passes with Intel 18.0.5. Then I can change the BKS and Generalized Davidson solver back to their previous state so they will not cause the CEE ATDM failures.

bartlettroscoe commented 5 years ago

@hkthorn, but we will still need an updated Intel 18.0.5 on the CEE machines to really make these errors go away, right?

hkthorn commented 5 years ago

@bartlettroscoe Sure, that would help. However, the failures on the CEE machines include Clang and GCC compilers. What LAPACK library are those builds compiled against?

bartlettroscoe commented 5 years ago

What LAPACK library are those builds compiled against?

@hkthorn, they all use MLK as you can see at:

srajama1 commented 5 years ago

@hkthorn Thanks for your patience in dealing with this environment issues. @maherou quote "HPC is a bloody sport and we are in the front line" :).

@bartlettroscoe : Understood. We will keep this build as a higher priority.

mhoemmen commented 5 years ago

I just worked around this issue in AztecOO. We can do it one spot at a time.

hkthorn commented 5 years ago

@mhoemmen Unfortunately, the work-around is what is causing the issue now. I put in similar logic to what you have done for AztecOO in PR #3951, but there seems to be a memory issue with GEES when the 'optimal' size is used in Intel MKL 17.0.x. Now, if I don't use the optimal size and just use the lower bound given by LAPACK guidance, everything works hokie dokie.

mhoemmen commented 5 years ago

@hkthorn uh oh :(

bartlettroscoe commented 5 years ago

Add the build Trilinos-atdm-cee-rhel6-gnu-4.9.3-openmpi-1.10.2-serial-static-opt to this issue as well as shown yesterday on CDash showing:

...

Number of iterations performed in BlockKrylovSchur_test.exe: 30
Direct residual norms computed in BlockKrylovSchur_test.exe
          Eigenvalue            Residual
----------------------------------------
        1.199112e+05        1.350125e-07
        1.196455e+05        1.336396e-07
        1.192047e+05        1.090344e-07
        1.185918e+05        1.352981e-04
        1.178109e+05        1.232853e-07

End Result: TEST FAILED
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[15891,1],1]
  Exit code:    255
--------------------------------------------------------------------------

bartlettroscoe commented 5 years ago

FYI: Looks like some of these failures are random like shown for the build Trilinos-atdm-cee-rhel6-clang-5.0.1-openmpi-1.10.2-serial-static-opt and the build Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt.

bartlettroscoe commented 5 years ago

As per TRIL-238, all failing Anasazi tests are nonblockers for ATDM APP updates so I have changed the label from "ATDM Blocker" to "ATDM Nonblocker".

hkthorn commented 5 years ago

The issue on Mutrino is the focus of #3499. This bug, addressing all other CEE platforms, has been addressed by PR #4031. Marking closed.

bartlettroscoe commented 5 years ago

The issue on Mutrino is the focus of #3499. This bug, addressing all other CEE platforms, has been addressed by PR #4031. Marking closed.

Okay, I updated the scope of #3499 to make it clear that it also includes the failures in the build Trilinos-atdm-cee-rhel6-intel-18.0.2-mpich2-3.2-serial-static-opt for the same reason as on 'mutrino' (i.e. bad MKL GEEV() function with intel-18.0.2).

trilinos / Trilinos