Closed fryeguy52 closed 5 years ago
I believe this is a new build, right ?
I believe this is a new build, right ?
@srajama1, no, it is also failing in the promoted mutrino build Trilinos-atdm-mutrino-intel-opt-openmp-HSW
as shown in this query.
@srajama1, I want to make sure that "new builds" is not a code word for "we are not going to address this anytime soon". Again, if you look at the big picture in ATDM, these 'cee-rhel6' builds are more important than every other of the promoted "ATDM" builds (except for the cuda-9.2 builds) because they protect SPARC which otherwise has no protection.
@bartlettroscoe Nope, it is not code for "we are not going to address this soon", just it is lower priority than "green that turned red recently or red in critical builds". We have to prioritize somehow. All of them can't be high priority. If you think these are higher priority than others, then I am really worried as we have been focusing on the wrong thing for the past few months. I appreciate your effort in helping Trilinos stability. I hope you can understand priorities are needed in a resource constrained environment.
@bartlettroscoe @srajama1 I don't see this error on my builds on Chama. Can we get an updated "Steps to Reproduce" that is Chama-related and not CEE related. I don't have access to a CEE machine. If all else fails, the test failures are both the failures that were plaguing Mutrino, for which I determined the issue with MKL GEES/GEEV. You can back out those changes to the Generalized Davidson and BKS solvers and let Mutrino fail until the Intel MKL is updated to a newer version.
@hkthorn this should work on chama, please note the WCID bit in salloc
$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-chama-intel-opt-openmp
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Anasazi=ON \
$TRILINOS_DIR
$ make NP=16
$ salloc -N1 --time=0:20:00 --account=<YOUR_WCID> ctest -j16
@fryeguy52 Thanks, I have a WCID, no problem.
According to the query (https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercombine=and&filtercombine=&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=testname&compare2=61&value2=Anasazi_Epetra_BKS_norestart_test_MPI_4&field3=buildstarttime&compare3=84&value3=2018-12-05T00%3A00%3A00&field4=buildstarttime&compare4=83&value4=2018-11-28T00%3A00%3A00&field5=status&compare5=61&value5=Failed) and the testing I'm performing on chama. This test does not fail on chama, just the CEE machines.
@hkthorn and @fryeguy52, sorry my bad. I read the summary email wrong. It is also failing in the 'mutrino' builds as described in #3499. If you look at this query, you can see everywhere it fails.
@hkthorn, looks like the tests on the 'cee-rhel6' builds and the 'mutrino' builds are all segfaulting. We need to add the error message seen in the tests to the Issue description.
@hkthorn, everyone at SNL should be able to get access to a CEE machine at this point. See:
https://snl-wiki.sandia.gov/display/TRIL/Development+and+Testing+on+CEE+LAN+Machines
You should be able to get to a shared machine. Ask Denis. He knows.
@bartlettroscoe Yes, but the 'mutrino' failures stopped after PR #3951, if you look at the dates. That is not an issue with 'mutrino' any longer, just the 'cee-rhel6' builds.
@bartlettroscoe Yes, but the 'mutrino' failures stopped after PR #3951, if you look at the dates. That is not an issue with 'mutrino' any longer, just the 'cee-rhel6' builds.
@hkthorn, okay, then let's get you access to a CEE LAN RHEL6 machine. It has been a long time since I tried to get access as I have had my CEE Blade for 2+ years now.
@fryeguy52, do you know the current strategy for getting on CEE LAN? Can we get @hkthorn access to my CEE machine 'ceerws1113' like you have?
@hkthorn, on nile you can subscribe to cee services. Under the Hardware/misc section, I have "SRN base" and "SRN Brokered Tier 1 Workstation" and can ssh to ceerws1113. I don't recall the turn around time for them to process requests but I think it is pretty fast
Nope, it is not code for "we are not going to address this soon", just it is lower priority than "green that turned red recently or red in critical builds". We have to prioritize somehow. All of them can't be high priority. If you think these are higher priority than others, then I am really worried as we have been focusing on the wrong thing for the past few months. I appreciate your effort in helping Trilinos stability. I hope you can understand priorities are needed in a resource constrained environment.
@srajama1, understood. I just wanted to stress that w.r.t. SPARC, the 'cee-rhel6' builds are more important than every other already promoted "ATDM" build except for perhaps the cuda-9.2 builds on 'waterman'. I just wanted to stress that cleaning up the 'cee-rhel6' builds and promoting them is higher priority than cleaning up the existing Promoted 'ATDM' builds, IMO.
@bartlettroscoe @srajama1 It looks like the LAPACK GEES method that both GeneralizedDavidson and BKS are using has memory issues when one uses the optimal workspace size, causing bad eigenvectors or seg faults. The way to fix this, so that tests stop failing, is to revert my recent changes to both those solvers. This means that Mutrino will start failing for those tests again. Thoughts?
It looks like the LAPACK GEES method that both GeneralizedDavidson and BKS are using has memory issues when one uses the optimal workspace size, causing bad eigenvectors or seg faults. The way to fix this, so that tests stop failing, is to revert my recent changes to both those solvers. This means that Mutrino will start failing for those tests again. Thoughts?
@hkthorn, wow, can't win, can we! Don't know what to do here. In any case, I have marked this with the label "ATDM Env Issue" to make it clear that this is caused by a problem in the env. Note that we are still being promised that they will put an updated Intel MKL on 'mutrino' that may fix these problems. But not sure if they would also update Intel MLK on the CEE LAN as well used by Sierra and SPARC.
FYI: So it might be that ATDM APPs are not even using Anasazi (see https://sems-atlassian-son.sandia.gov/jira/browse/TRIL-238). If that is the case, we might just disable Anasazi and all of these tests in ATDM Trilinos builds and APP builds of Trilinos going forward. But we have to confirm that. (Other non-ATDM users of Trilinos may just will have to fend for themselves if they are using Anasazi on these platforms with defective MKL.)
@hkthorn, they report that they have Intel 18.0.5 installed on 'mutrino'. I will give this a try and see if it fixes this test.
@bartlettroscoe This test passes on 'mutrino' with the changes I made in the most recent PR. It is failing on the CEE machines. So, first you have to revert the changes to this solver to see if it passes with Intel 18.0.5. Then I can change the BKS and Generalized Davidson solver back to their previous state so they will not cause the CEE ATDM failures.
@hkthorn, but we will still need an updated Intel 18.0.5 on the CEE machines to really make these errors go away, right?
@bartlettroscoe Sure, that would help. However, the failures on the CEE machines include Clang and GCC compilers. What LAPACK library are those builds compiled against?
What LAPACK library are those builds compiled against?
@hkthorn, they all use MLK as you can see at:
@hkthorn Thanks for your patience in dealing with this environment issues. @maherou quote "HPC is a bloody sport and we are in the front line" :).
@bartlettroscoe : Understood. We will keep this build as a higher priority.
I just worked around this issue in AztecOO. We can do it one spot at a time.
@mhoemmen Unfortunately, the work-around is what is causing the issue now. I put in similar logic to what you have done for AztecOO in PR #3951, but there seems to be a memory issue with GEES when the 'optimal' size is used in Intel MKL 17.0.x. Now, if I don't use the optimal size and just use the lower bound given by LAPACK guidance, everything works hokie dokie.
@hkthorn uh oh :(
Add the build Trilinos-atdm-cee-rhel6-gnu-4.9.3-openmpi-1.10.2-serial-static-opt
to this issue as well as shown yesterday on CDash showing:
...
Number of iterations performed in BlockKrylovSchur_test.exe: 30
Direct residual norms computed in BlockKrylovSchur_test.exe
Eigenvalue Residual
----------------------------------------
1.199112e+05 1.350125e-07
1.196455e+05 1.336396e-07
1.192047e+05 1.090344e-07
1.185918e+05 1.352981e-04
1.178109e+05 1.232853e-07
End Result: TEST FAILED
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15891,1],1]
Exit code: 255
--------------------------------------------------------------------------
FYI: Looks like some of these failures are random like shown for the build Trilinos-atdm-cee-rhel6-clang-5.0.1-openmpi-1.10.2-serial-static-opt and the build Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt.
As per TRIL-238, all failing Anasazi tests are nonblockers for ATDM APP updates so I have changed the label from "ATDM Blocker" to "ATDM Nonblocker".
The issue on Mutrino is the focus of #3499. This bug, addressing all other CEE platforms, has been addressed by PR #4031. Marking closed.
The issue on Mutrino is the focus of #3499. This bug, addressing all other CEE platforms, has been addressed by PR #4031. Marking closed.
Okay, I updated the scope of #3499 to make it clear that it also includes the failures in the build Trilinos-atdm-cee-rhel6-intel-18.0.2-mpich2-3.2-serial-static-opt
for the same reason as on 'mutrino' (i.e. bad MKL GEEV() function with intel-18.0.2).
CC: @trilinos/anasazi, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
Next Action Status
Triggered by the PR #3951 merged to 'develop' on 10/28/2018 that worked around Intel 18.0.2 MKL GEEV defect. Next: Try updated Intel MKL 18.0.5 on 'mutrino' (with local revert of #3951) and see all of these failures go away (@fryeguy52) ...
Description
As shown in this query the test:
is failing in the builds:
Looks like some of these failures are random like shown for the build Trilinos-atdm-cee-rhel6-clang-5.0.1-openmpi-1.10.2-serial-static-opt and the build Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt.
The errors look like here for example:
Current Status on CDash
The current status of these tests/builds for the current testing day can be found here
Steps to Reproduce
One should be able to reproduce this failure on a machine with a cee rhel6 environment as described in:
More specifically, the commands given for a machine with a cee rhel6 environment are provided at:
The exact commands to reproduce this issue should be: