Closed bartlettroscoe closed 6 years ago
This test Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4
newly failed in the build Trilinos-atdm-white-ride-cuda-debug
on 'white' today as shown at:
showing:
Generating Y1,Y2 for project() : testing...
|| <Y1,Y1> - I || : 7.13673e-16
|| <Y2,Y2> - I || : 7.85286e-16
|| <X1,Y2> || : 1.71386e-16
|| <X1b,Y2> || : 7.10285e-15
p=1: *** Caught standard std::exception of type 'std::runtime_error' :
/home/rabartl/WHITE/ATDM_Driver/Trilinos-atdm-white-ride-cuda-debug/SRC_AND_BUILD/Trilinos/packages/anasazi/epetra/test/OrthoManager/cxx_gentest.cpp:274:
Throw number = 1
Throw test that evaluated to true: err > TOL
New X1 did not meet tolerance: orthog(X1,Y2) == 0.356233
...
It passed yesterday in the same build as shown at:
Looking at the history of this test on this build on 'white' in the query:
it fails three other times on various days going back to 3/12/2018. This suggests non-deterministic behavior causing the test to randomly fail.
Does this test cause some non-deterministic behavior about Anasazi or the underlying software being used? Could this be exposing a weakness in Trilinos software that could bite a user in a CUDA build?
In any case, I think this test should be disabled for now on these CUDA debug builds so that we can promote this build Trilinos-atdm-white-ride-cuda-debug
to the "ATDM" CDash Track/Group which opens the door to using it as an auto PR build for Trilinos (which will be huge for stabilizing Trilinos for ATDM customers). Then, someone can debug this test offline when they get some time.
@mhoemmen, what do you think about this? Is it okay to disable this test for now until someone can debug what is causing the non-deterministic behavior?
@bartlettroscoe wrote:
Is it okay to disable this test for now until someone can debug what is causing the non-deterministic behavior?
@hkthorn may have something to say, but I think it would be best to disable the test for now, as long as we don't "forget" (e.g., as long as we open a separate issue for the failing tests).
I think it would be best to disable the test for now, as long as we don't "forget" (e.g., as long as we open a separate issue for the failing tests).
@mhoemmen and @hkthorn,
One option is to leave these issues open with the label "Disabled Tests" and assign it to the Product Lead for the area. Who is the Product Lead for Anasazi? Is that @srajama1?
Anasazi is a problem child that got stuck with a (linear solvers) family where it may not belong :). Yes, I am the lead. Let us wait for what @hkthorn says.
I worry this might be exposing something non-deterministic underneath.
These randomly failing tests triggered the following CDash error email for the newly promoted build ??? this morning.
Can I go ahead and disable these randomly failing test in these builds? The tests will only be disabled for these builds and not others where the test is passing consistently.
From: CDash [mailto:trilinos-regression@sandia.gov] Sent: Saturday, March 31, 2018 2:48 AM To: Bartlett, Roscoe A rabartl@sandia.gov Subject: FAILED (t=2): Trilinos/Anasazi - Trilinos-atdm-white-ride-gnu-debug-openmp - ATDM
A submission to CDash for the project Trilinos has failing tests. You have been identified as one of the authors who have checked in changes that are part of this submission or you are listed in the default contact list.
Details on the submission can be found at https://testing.sandia.gov/cdash/buildSummary.php?buildid=3474500
Project: Trilinos SubProject: Anasazi Site: white Build Name: Trilinos-atdm-white-ride-gnu-debug-openmp Build Time: 2018-03-31T06:45:53 UTC Type: ATDM Tests failing: 2
Tests failing Anasazi_Epetra_ModalSolversTester_MPI_4 (https://testing.sandia.gov/cdash/testDetails.php?test=46065301&build=3474500) Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 (https://testing.sandia.gov/cdash/testDetails.php?test=46065302&build=3474500)
-CDash on testing.sandia.gov
@bartlettroscoe Please do; thanks!
@bartlettroscoe @srajama1 @mhoemmen Go ahead and disable the failing tests for this platform, I have seen this issue before. Thanks!
From @hkthorn:
Go ahead and disable the failing tests for this platform, I have seen this issue before. Thanks!
Okay, I will disable these failing tests. However, also note that we saw two new failing Anasazi tests for this build today shown in the below email.
The first test Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4
was a segfault. The last two look to be diffs.
Should we disable these tests as well? If not, does someone on the Linear Solvers area have time to triage these some more? We either need to fix the test or disable them (and then leave this issue as a reminder to fix them along with other approaches that we can consider to keep reminders of disabled tests).
From: CDash [mailto:trilinos-regression@sandia.gov] Sent: Tuesday, April 03, 2018 1:32 AM To: Bartlett, Roscoe A Subject: FAILED (t=3): Trilinos/Anasazi - Trilinos-atdm-white-ride-gnu-debug- openmp - ATDM
A submission to CDash for the project Trilinos has failing tests. You have been identified as one of the authors who have checked in changes that are part of this submission or you are listed in the default contact list.
Details on the submission can be found at https://testing.sandia.gov/cdash/buildSummary.php?buildid=3480083
Project: Trilinos SubProject: Anasazi Site: ride Build Name: Trilinos-atdm-white-ride-gnu-debug-openmp Build Time: 2018-04-03T07:30:22 UTC Type: ATDM Tests failing: 3
Tests failing Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 (https://testing.sandia.gov/cdash/testDetails.php?test=46173794&build=3480083) Anasazi_Epetra_OrthoManagerGenTester_1_MPI_4 (https://testing.sandia.gov/cdash/testDetails.php?test=46173795&build=3480083) Anasazi_Epetra_LOBPCG_solvertest_MPI_4 (https://testing.sandia.gov/cdash/testDetails.php?test=46173813&build=3480083)
-CDash on testing.sandia.gov
If you look at the query:
(which shows all of the failing Anasazi tests in the last two weeks that have not already been disabled (see #2455) or are not in the 'opt' builds on white/ride (see #2454)), you can see that the tests:
Anasazi_Epetra_ModalSolversTester_MPI_4
Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4
Anasazi_Epetra_OrthoManagerGenTester_1_MPI_4
fail multiple times on various days in the two builds:
Trilinos-atdm-white-ride-cuda-debug
Trilinos-atdm-white-ride-gnu-debug-openmp
All three of these tests failed multiple days in the Trilinos-atdm-white-ride-cuda-debug
build which is being targeted for an auto PR testing build (see #2464). Therefore, these should be disabled (as @hkthorn noted above).
The test Anasazi_Epetra_LOBPCG_solvertest_MPI_4
only failed today in the build Trilinos-atdm-white-ride-gnu-debug-openmp
as shown in the above query. Therefore, this might have been a fluke so we should not disable this yet.
FYI: I created PR #2501 to disable these three randomly failing tests. I requested a review from @mhoemmen and/or @hkthorn.
Just realized that the @trilinos/framework team ran into these same randomly failing tests in #1393 and they resolved the issue by disabling those tests as well. So it looks like this is the right decision to disable these tests in the ATDM builds.
But it also suggests that perhaps the problems with these tests should be studied more carefully or these tests just need to be disabled all together. That way, other people and projects will not run into these randomly failing tests over and over again. And if these are the only real tests for "ModelSolvers" in Anasazi, then perhaps that feature is not ready to be used by people and should be disabled by default as experimental code or something? Then we set up some build of Trilinos for all of this "Experimental" code so at least we know how it is doing.
The PR #2501 was merged just now merging the commit 2e9da0c. Therefore, we should see these three tests disabled for these builds white/ride tomorrow.
Putting this issue in review
@bartlettroscoe @mhoemmen @srajama1 I have found the underlying issue in these tests. They use a Teuchos::SerialDenseMatrix, which is a serial object without MPI communication or implied synchronization of values. These matrices are randomized on each processor an then used to perform tests of the orthogonalization routines and modal solvers. Again, there is no explicit synchronization of Teuchos SDM objects, so when the randomization generates different matrices on different processors, the tests fail because the explicit expectations of the classes being tested, orthogonalization and modal solvers, are violated. I have a feeling this pattern might be in Belos as well. I will fix this today.
@hkthorn Wow! Thanks for finding this; sounds tricky!
@hkthorn, so this is a defect in the tests not the library code that users depend on?
Let me know when you have merged the fix into the Trilinos 'develop' branch and then I will re-enable these tests and we will let them run in the ATDM builds of Trilinos.
@bartlettroscoe @mhoemmen Absolutely, this is a defect in the design of the test. I will let you know when the fix is in Trilinos 'develop' branch so we can re-enable the tests for ATDM builds.
It looks like the test Anasazi_Epetra_LOBPCG_solvertest_MPI_4
may also also have some random failures. We saw the following failure for this test in the build Trilinos-atdm-white-ride-gnu-debug-openmp
on 'white' on 4/18/2018:
which showed:
Anasazi in Trilinos 12.13 (Dev)
Testing solver(default,default) with standard eigenproblem...
Testing solver(default,default) with generalized eigenproblem...
Testing solver(nev,false) with standard eigenproblem...
Testing solver(nev,true) with standard eigenproblem...
Testing solver(nev,false) with generalized eigenproblem...
Testing solver(nev,true) with generalized eigenproblem...
Testing solver(2*nev,false) with standard eigenproblem...
Testing solver(2*nev,true) with standard eigenproblem...
Testing solver(2*nev,false) with generalized eigenproblem...
Testing solver(2*nev,true) with generalized eigenproblem...
[white25:127665] *** Process received signal ***
[white25:127665] Signal: Segmentation fault (11)
[white25:127665] Signal code: Address not mapped (1)
[white25:127665] Failing at address: 0x10024850020
[white25:127665] [ 0] [0x100000050478]
[white25:127665] [ 1] [0x3ff0000000000000]
[white25:127665] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 3 with PID 127665 on node white25 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Looking at the query:
it looks like this test also failed on 'ride' in the same build on 4/3/2018 with the output:
Anasazi in Trilinos 12.13 (Dev)
Testing solver(default,default) with standard eigenproblem...
Testing solver(default,default) with generalized eigenproblem...
Testing solver(nev,false) with standard eigenproblem...
Testing solver(nev,true) with standard eigenproblem...
Testing solver(nev,false) with generalized eigenproblem...
Testing solver(nev,true) with generalized eigenproblem...
Testing solver(2*nev,false) with standard eigenproblem...
Testing solver(2*nev,true) with standard eigenproblem...
[ride13:114533] *** Process received signal ***
[ride13:114533] Signal: Segmentation fault (11)
[ride13:114533] Signal code: Address not mapped (1)
[ride13:114533] Failing at address: 0x10036020010
[ride13:114533] [ 0] [0x100000050478]
[ride13:114533] [ 1] [0x3ff0000000000000]
[ride13:114533] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 114533 on node ride13 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
We can keep an eye to see if this test fails again in this build or some other build. But if it does, we should likely disable this test for now.
I'll give the test a look to see if there are any bad patterns there. I have merged the PR that fixes the testing for the OrthoManager and ModalSolvers:
https://github.com/trilinos/Trilinos/pull/2517
Thanks!
The PR #2621 was merged that re-enables these tests. Now we wait and see how they run and if they fail or not in the coming days and weeks. I am removing the "Disabled Tests" label.
NOTE: The test Anasazi_Epetra_LOBPCG_solvertest_MPI_4
that was randomly failing as described above is still randomly failing with a segfault, as recent as 2018-04-23. Therefore, since PR #2517 did not fix this test, we can assume it is unrelated to the other Anasazi tests covered in this issue. I created the new issue #2633 to address the issues with that test.
Therefore, all that is left for this current issue is to watch and see if we see any more random failures with the tests Anasazi_Epetra_ModalSolversTester_MPI_4
and Anasazi_Epetra_OrthoManagerGenTester_[0,1]_MPI_4
...
Looking at the recent history for these tests on CDash after 5/19/2018 (when the NETLIB BLAS and LAPACK got put back as described in https://github.com/trilinos/Trilinos/issues/2454#issuecomment-390451738) in the following queries:
We can see these tests did not fail a single time and it shows these tests running in the Trilinos-atdm-white-ride-gnu-debug-openmp
and Trilinos-atdm-white-ride-cuda-debug
builds.
Therefore, this issue appears to be resolved.
Closing as complete.
CC: @trilinos/anasazi, @mhoemmen
Next Action Status
PR #2621 merged on 4/24/2018 that re-enables the tests
Anasazi_Epetra_ModalSolversTester_MPI_4
andAnasazi_Epetra_OrthoManagerGenTester_[0,1]_MPI_4
. Tests ran and passed in all promoted ATDM Trilinos builds between 5/20/2018 and 6/7/2018.Description
The tests:
Anasazi_Epetra_ModalSolversTester_MPI_4
Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4
Anasazi_Epetra_OrthoManagerGenTester_1_MPI_4
failed in
Trilinos-atdm-hansen-shiller-cuda-debug
build on 'ride' as shown at:This build is targeted to be an auto PR build for Trilinos (see #2464) so we desire to clean up this build more quickly.
Intrestingly, these tests did not fail in what should be the idential
Trilinos-atdm-hansen-shiller-cuda-debug
build on the identical machine 'white' as shown at:Strangely, those tests did fail on
Trilinos-atdm-hansen-shiller-cuda-debug
build on 'white' yestrday shown at:A) Anasazi_Epetra_ModalSolversTester_MPI_4:
Test failing test
Anasazi_Epetra_ModalSolversTester_MPI_4
today with details shown at:showed the failure:
Looking at all of the builds today that ran that test shown at:
this test fails in the same way (i.e. a numerical problem) on the builds
Linux-gcc-4.8.4-MPI_RELEASE_12.12.1
andLinux-gcc-4.8.4-MPI_RELEASE_12.12.1_SHARED
on the machinehansel.sandia.gov
so this problem is not isolated to ATDM builds of Trilinos.Also note that this test failed for the ATDM builds
Trilinos-atdm-white-ride-gnu-opt-openmp
andTrilinos-atdm-white-ride-gnu-opt-openmp
with segfaults, but that is already being addressed by #2454 and is likely unrelated.B) Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4:
The failing test
Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4
today with details shown at:showed:
Looking at all of the builds today that ran that test shown at:
you can see that this test also failed in a similar (numerical) way in the builds
Linux-gcc-4.9.3-Sierra_MPI_release_DEV_ETI_SERIAL-ON_OPENMP-ON_PTHREAD-OFF_CUDA-OFF_COMPLEX-ON
andLinux-GCC-4.9.3-openmpi-1.8.7_Debug_DEV_Werror
so it looks like this problem is not isolated to ATDM builds of Trilinos. Note that one of those is a "Sierra' build of Trilinos.