trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.2k stars 563 forks source link

Tests Anasazi_Epetra_ModalSolversTester_MPI_4 and Anasazi_Epetra_OrthoManagerGenTester_[0,1]_MPI_4 failing in 'debug' builds on white/ride #2473

Closed bartlettroscoe closed 6 years ago

bartlettroscoe commented 6 years ago

CC: @trilinos/anasazi, @mhoemmen

Next Action Status

PR #2621 merged on 4/24/2018 that re-enables the tests Anasazi_Epetra_ModalSolversTester_MPI_4 and Anasazi_Epetra_OrthoManagerGenTester_[0,1]_MPI_4 . Tests ran and passed in all promoted ATDM Trilinos builds between 5/20/2018 and 6/7/2018.

Description

The tests:

failed in Trilinos-atdm-hansen-shiller-cuda-debug build on 'ride' as shown at:

This build is targeted to be an auto PR build for Trilinos (see #2464) so we desire to clean up this build more quickly.

Intrestingly, these tests did not fail in what should be the idential Trilinos-atdm-hansen-shiller-cuda-debug build on the identical machine 'white' as shown at:

Strangely, those tests did fail on Trilinos-atdm-hansen-shiller-cuda-debug build on 'white' yestrday shown at:

A) Anasazi_Epetra_ModalSolversTester_MPI_4:

Test failing test Anasazi_Epetra_ModalSolversTester_MPI_4 today with details shown at:

showed the failure:

************* Householder Apply Test *************

             orthonorm error of V: 7.08978e-16
            orthonorm error of VQ: 0.375867
ERROR:  V*Q failed.
    orthonorm error of applyHouse: 0.375867
ERROR:  applyHouse failed.
        error(VQ - house(V,H,tau): 2.64481e-16

************* DirectSolver Test *************

Looking at all of the builds today that ran that test shown at:

this test fails in the same way (i.e. a numerical problem) on the builds Linux-gcc-4.8.4-MPI_RELEASE_12.12.1 and Linux-gcc-4.8.4-MPI_RELEASE_12.12.1_SHARED on the machine hansel.sandia.gov so this problem is not isolated to ATDM builds of Trilinos.

Also note that this test failed for the ATDM builds Trilinos-atdm-white-ride-gnu-opt-openmp and Trilinos-atdm-white-ride-gnu-opt-openmp with segfaults, but that is already being addressed by #2454 and is likely unrelated.

B) Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4:

The failing test Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 today with details shown at:

showed:

Anasazi in Trilinos 12.13 (Dev)

 Generating Y1,Y2 for project() : testing... 
   || <Y1,Y1> - I || : 6.47718e-16
   || <Y2,Y2> - I || : 7.20309e-16
   || <X1,Y2> ||     : 1.64775e-16
   || <X1b,Y2> ||     : 6.9984e-15

p=3: *** Caught standard std::exception of type 'std::runtime_error' :

 /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-debug/SRC_AND_BUILD/Trilinos/packages/anasazi/epetra/test/OrthoManager/cxx_gentest.cpp:274:

 Throw number = 1

 Throw test that evaluated to true: err > TOL

 New X1 did not meet tolerance: orthog(X1,Y2) == 0.547032

Looking at all of the builds today that ran that test shown at:

you can see that this test also failed in a similar (numerical) way in the builds Linux-gcc-4.9.3-Sierra_MPI_release_DEV_ETI_SERIAL-ON_OPENMP-ON_PTHREAD-OFF_CUDA-OFF_COMPLEX-ON and Linux-GCC-4.9.3-openmpi-1.8.7_Debug_DEV_Werror so it looks like this problem is not isolated to ATDM builds of Trilinos. Note that one of those is a "Sierra' build of Trilinos.

bartlettroscoe commented 6 years ago

This test Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 newly failed in the build Trilinos-atdm-white-ride-cuda-debug on 'white' today as shown at:

showing:

Generating Y1,Y2 for project() : testing... 
   || <Y1,Y1> - I || : 7.13673e-16
   || <Y2,Y2> - I || : 7.85286e-16
   || <X1,Y2> ||     : 1.71386e-16
   || <X1b,Y2> ||     : 7.10285e-15

p=1: *** Caught standard std::exception of type 'std::runtime_error' :

 /home/rabartl/WHITE/ATDM_Driver/Trilinos-atdm-white-ride-cuda-debug/SRC_AND_BUILD/Trilinos/packages/anasazi/epetra/test/OrthoManager/cxx_gentest.cpp:274:

 Throw number = 1

 Throw test that evaluated to true: err > TOL

 New X1 did not meet tolerance: orthog(X1,Y2) == 0.356233

...

It passed yesterday in the same build as shown at:

Looking at the history of this test on this build on 'white' in the query:

it fails three other times on various days going back to 3/12/2018. This suggests non-deterministic behavior causing the test to randomly fail.

Does this test cause some non-deterministic behavior about Anasazi or the underlying software being used? Could this be exposing a weakness in Trilinos software that could bite a user in a CUDA build?

In any case, I think this test should be disabled for now on these CUDA debug builds so that we can promote this build Trilinos-atdm-white-ride-cuda-debug to the "ATDM" CDash Track/Group which opens the door to using it as an auto PR build for Trilinos (which will be huge for stabilizing Trilinos for ATDM customers). Then, someone can debug this test offline when they get some time.

@mhoemmen, what do you think about this? Is it okay to disable this test for now until someone can debug what is causing the non-deterministic behavior?

mhoemmen commented 6 years ago

@bartlettroscoe wrote:

Is it okay to disable this test for now until someone can debug what is causing the non-deterministic behavior?

@hkthorn may have something to say, but I think it would be best to disable the test for now, as long as we don't "forget" (e.g., as long as we open a separate issue for the failing tests).

bartlettroscoe commented 6 years ago

I think it would be best to disable the test for now, as long as we don't "forget" (e.g., as long as we open a separate issue for the failing tests).

@mhoemmen and @hkthorn,

One option is to leave these issues open with the label "Disabled Tests" and assign it to the Product Lead for the area. Who is the Product Lead for Anasazi? Is that @srajama1?

srajama1 commented 6 years ago

Anasazi is a problem child that got stuck with a (linear solvers) family where it may not belong :). Yes, I am the lead. Let us wait for what @hkthorn says.

I worry this might be exposing something non-deterministic underneath.

bartlettroscoe commented 6 years ago

These randomly failing tests triggered the following CDash error email for the newly promoted build ??? this morning.

Can I go ahead and disable these randomly failing test in these builds? The tests will only be disabled for these builds and not others where the test is passing consistently.


From: CDash [mailto:trilinos-regression@sandia.gov] Sent: Saturday, March 31, 2018 2:48 AM To: Bartlett, Roscoe A rabartl@sandia.gov Subject: FAILED (t=2): Trilinos/Anasazi - Trilinos-atdm-white-ride-gnu-debug-openmp - ATDM

A submission to CDash for the project Trilinos has failing tests. You have been identified as one of the authors who have checked in changes that are part of this submission or you are listed in the default contact list.

Details on the submission can be found at https://testing.sandia.gov/cdash/buildSummary.php?buildid=3474500

Project: Trilinos SubProject: Anasazi Site: white Build Name: Trilinos-atdm-white-ride-gnu-debug-openmp Build Time: 2018-03-31T06:45:53 UTC Type: ATDM Tests failing: 2

Tests failing Anasazi_Epetra_ModalSolversTester_MPI_4 (https://testing.sandia.gov/cdash/testDetails.php?test=46065301&build=3474500) Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 (https://testing.sandia.gov/cdash/testDetails.php?test=46065302&build=3474500)

-CDash on testing.sandia.gov

mhoemmen commented 6 years ago

@bartlettroscoe Please do; thanks!

hkthorn commented 6 years ago

@bartlettroscoe @srajama1 @mhoemmen Go ahead and disable the failing tests for this platform, I have seen this issue before. Thanks!

bartlettroscoe commented 6 years ago

From @hkthorn:

Go ahead and disable the failing tests for this platform, I have seen this issue before. Thanks!

Okay, I will disable these failing tests. However, also note that we saw two new failing Anasazi tests for this build today shown in the below email.

The first test Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 was a segfault. The last two look to be diffs.

Should we disable these tests as well? If not, does someone on the Linear Solvers area have time to triage these some more? We either need to fix the test or disable them (and then leave this issue as a reminder to fix them along with other approaches that we can consider to keep reminders of disabled tests).


From: CDash [mailto:trilinos-regression@sandia.gov] Sent: Tuesday, April 03, 2018 1:32 AM To: Bartlett, Roscoe A Subject: FAILED (t=3): Trilinos/Anasazi - Trilinos-atdm-white-ride-gnu-debug- openmp - ATDM

A submission to CDash for the project Trilinos has failing tests. You have been identified as one of the authors who have checked in changes that are part of this submission or you are listed in the default contact list.

Details on the submission can be found at https://testing.sandia.gov/cdash/buildSummary.php?buildid=3480083

Project: Trilinos SubProject: Anasazi Site: ride Build Name: Trilinos-atdm-white-ride-gnu-debug-openmp Build Time: 2018-04-03T07:30:22 UTC Type: ATDM Tests failing: 3

Tests failing Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 (https://testing.sandia.gov/cdash/testDetails.php?test=46173794&build=3480083) Anasazi_Epetra_OrthoManagerGenTester_1_MPI_4 (https://testing.sandia.gov/cdash/testDetails.php?test=46173795&build=3480083) Anasazi_Epetra_LOBPCG_solvertest_MPI_4 (https://testing.sandia.gov/cdash/testDetails.php?test=46173813&build=3480083)

-CDash on testing.sandia.gov

bartlettroscoe commented 6 years ago

If you look at the query:

(which shows all of the failing Anasazi tests in the last two weeks that have not already been disabled (see #2455) or are not in the 'opt' builds on white/ride (see #2454)), you can see that the tests:

fail multiple times on various days in the two builds:

All three of these tests failed multiple days in the Trilinos-atdm-white-ride-cuda-debug build which is being targeted for an auto PR testing build (see #2464). Therefore, these should be disabled (as @hkthorn noted above).

The test Anasazi_Epetra_LOBPCG_solvertest_MPI_4 only failed today in the build Trilinos-atdm-white-ride-gnu-debug-openmp as shown in the above query. Therefore, this might have been a fluke so we should not disable this yet.

bartlettroscoe commented 6 years ago

FYI: I created PR #2501 to disable these three randomly failing tests. I requested a review from @mhoemmen and/or @hkthorn.

bartlettroscoe commented 6 years ago

Just realized that the @trilinos/framework team ran into these same randomly failing tests in #1393 and they resolved the issue by disabling those tests as well. So it looks like this is the right decision to disable these tests in the ATDM builds.

But it also suggests that perhaps the problems with these tests should be studied more carefully or these tests just need to be disabled all together. That way, other people and projects will not run into these randomly failing tests over and over again. And if these are the only real tests for "ModelSolvers" in Anasazi, then perhaps that feature is not ready to be used by people and should be disabled by default as experimental code or something? Then we set up some build of Trilinos for all of this "Experimental" code so at least we know how it is doing.

bartlettroscoe commented 6 years ago

The PR #2501 was merged just now merging the commit 2e9da0c. Therefore, we should see these three tests disabled for these builds white/ride tomorrow.

Putting this issue in review

hkthorn commented 6 years ago

@bartlettroscoe @mhoemmen @srajama1 I have found the underlying issue in these tests. They use a Teuchos::SerialDenseMatrix, which is a serial object without MPI communication or implied synchronization of values. These matrices are randomized on each processor an then used to perform tests of the orthogonalization routines and modal solvers. Again, there is no explicit synchronization of Teuchos SDM objects, so when the randomization generates different matrices on different processors, the tests fail because the explicit expectations of the classes being tested, orthogonalization and modal solvers, are violated. I have a feeling this pattern might be in Belos as well. I will fix this today.

mhoemmen commented 6 years ago

@hkthorn Wow! Thanks for finding this; sounds tricky!

bartlettroscoe commented 6 years ago

@hkthorn, so this is a defect in the tests not the library code that users depend on?

Let me know when you have merged the fix into the Trilinos 'develop' branch and then I will re-enable these tests and we will let them run in the ATDM builds of Trilinos.

hkthorn commented 6 years ago

@bartlettroscoe @mhoemmen Absolutely, this is a defect in the design of the test. I will let you know when the fix is in Trilinos 'develop' branch so we can re-enable the tests for ATDM builds.

bartlettroscoe commented 6 years ago

It looks like the test Anasazi_Epetra_LOBPCG_solvertest_MPI_4 may also also have some random failures. We saw the following failure for this test in the build Trilinos-atdm-white-ride-gnu-debug-openmp on 'white' on 4/18/2018:

which showed:

Anasazi in Trilinos 12.13 (Dev)

Testing solver(default,default) with standard eigenproblem...
Testing solver(default,default) with generalized eigenproblem...
Testing solver(nev,false) with standard eigenproblem...
Testing solver(nev,true) with standard eigenproblem...
Testing solver(nev,false) with generalized eigenproblem...
Testing solver(nev,true) with generalized eigenproblem...
Testing solver(2*nev,false) with standard eigenproblem...
Testing solver(2*nev,true) with standard eigenproblem...
Testing solver(2*nev,false) with generalized eigenproblem...
Testing solver(2*nev,true) with generalized eigenproblem...
[white25:127665] *** Process received signal ***
[white25:127665] Signal: Segmentation fault (11)
[white25:127665] Signal code: Address not mapped (1)
[white25:127665] Failing at address: 0x10024850020
[white25:127665] [ 0] [0x100000050478]
[white25:127665] [ 1] [0x3ff0000000000000]
[white25:127665] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 3 with PID 127665 on node white25 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Looking at the query:

it looks like this test also failed on 'ride' in the same build on 4/3/2018 with the output:


Anasazi in Trilinos 12.13 (Dev)

Testing solver(default,default) with standard eigenproblem...
Testing solver(default,default) with generalized eigenproblem...
Testing solver(nev,false) with standard eigenproblem...
Testing solver(nev,true) with standard eigenproblem...
Testing solver(nev,false) with generalized eigenproblem...
Testing solver(nev,true) with generalized eigenproblem...
Testing solver(2*nev,false) with standard eigenproblem...
Testing solver(2*nev,true) with standard eigenproblem...
[ride13:114533] *** Process received signal ***
[ride13:114533] Signal: Segmentation fault (11)
[ride13:114533] Signal code: Address not mapped (1)
[ride13:114533] Failing at address: 0x10036020010
[ride13:114533] [ 0] [0x100000050478]
[ride13:114533] [ 1] [0x3ff0000000000000]
[ride13:114533] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 114533 on node ride13 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

We can keep an eye to see if this test fails again in this build or some other build. But if it does, we should likely disable this test for now.

hkthorn commented 6 years ago

I'll give the test a look to see if there are any bad patterns there. I have merged the PR that fixes the testing for the OrthoManager and ModalSolvers:

https://github.com/trilinos/Trilinos/pull/2517

Thanks!

bartlettroscoe commented 6 years ago

The PR #2621 was merged that re-enables these tests. Now we wait and see how they run and if they fail or not in the coming days and weeks. I am removing the "Disabled Tests" label.

bartlettroscoe commented 6 years ago

NOTE: The test Anasazi_Epetra_LOBPCG_solvertest_MPI_4 that was randomly failing as described above is still randomly failing with a segfault, as recent as 2018-04-23. Therefore, since PR #2517 did not fix this test, we can assume it is unrelated to the other Anasazi tests covered in this issue. I created the new issue #2633 to address the issues with that test.

Therefore, all that is left for this current issue is to watch and see if we see any more random failures with the tests Anasazi_Epetra_ModalSolversTester_MPI_4 and Anasazi_Epetra_OrthoManagerGenTester_[0,1]_MPI_4 ...

bartlettroscoe commented 6 years ago

Looking at the recent history for these tests on CDash after 5/19/2018 (when the NETLIB BLAS and LAPACK got put back as described in https://github.com/trilinos/Trilinos/issues/2454#issuecomment-390451738) in the following queries:

We can see these tests did not fail a single time and it shows these tests running in the Trilinos-atdm-white-ride-gnu-debug-openmp and Trilinos-atdm-white-ride-cuda-debug builds.

Therefore, this issue appears to be resolved.

Closing as complete.