Closed fryeguy52 closed 4 years ago
@trilinos/muelu,
These are showing errors like shown here showing:
...
STS::magnitude(diagVec->norm1() - diagVec->getGlobalLength()) < 100*TMT::eps() = false == true = true : FAILED ==> /gpfs1/jenkins/serrano-slave/workspace/Trilinos-atdm-serrano-intel-opt-openmp/SRC_AND_BUILD/Trilinos/packages/muelu/test/unit_tests_kokkos/TentativePFactory_kokkos.cpp:362
...
[FAILED] (0.00249 sec) TentativePFactory_kokkos_double_int_int_Kokkos_Compat_KokkosOpenMPWrapperNode_MakeTentativeVectorBasedUsingDefaultNullSpace_UnitTest
Location: /gpfs1/jenkins/serrano-slave/workspace/Trilinos-atdm-serrano-intel-opt-openmp/SRC_AND_BUILD/Trilinos/packages/muelu/test/unit_tests_kokkos/TentativePFactory_kokkos.cpp:264
Can someone please update this unit testing code to use TEST_FLOATING_EQUALITY() so we can see the actual numbers to see how far this is failing by? For example, you could use:
TEST_FLOATING_EQUALITY(
STS::magnitude(diagVec->norm1()),
STS::magnitude(diagVec->getGlobalLength()),
STS::magnitude(100*TMT::eps()),
That would print the numbers being compared and the tolerance so we can see why this is failing.
@bartlettroscoe sure I can work on this, I have modified a few things in MueLu recently to start using the epsilon function from Teuchos instead of hard coding a value and it could be the reason for this failure.
@bartlettroscoe @fryeguy52 I have a PR in progress that updates a few tests in MueLu that were failing for various reasons. Among them the offending unit-test pointed to above.
I also want to point at that the pull request auto-tester does not seem to turn on the Teuchos_GLOBALLY_REDUCE_UNITTEST_RESULTS
which seems odd and potentially dangerous!
@lucbv said:
I also want to point at that the pull request auto-tester does not seem to turn on the
Teuchos_GLOBALLY_REDUCE_UNITTEST_RESULTS
which seems odd and potentially dangerous!
Right. Someone needs to clean up all of the flaky Trilinos tests so we can enable that. But for tests you control, just use the unit test driver Teuchos_StandardParallelUnitTestMain.cpp will will globally reduce unit test results.
@fryeguy52 @bartlettroscoe there is some progress, at least the serial test is now passing see this query. @csiefer2 do you have any time to look at the issue with the RAPShift factory on waterman?
FYI: Still lots of random failures of these tests in the build Trilinos-atdm-waterman_cuda-9.2_shared_opt
as shown here. The test MueLu_UnitTestsTpetra_MPI_1
failed once in the last 10 days and the test MueLu_UnitTestsTpetra_MPI_4
failed 8 times in the last 10 days.
But as shown in this query it looks like the build Trilinos-atdm-waterman_cuda-9.2_shared_opt
is the only build where these tests failed in the last 10 days.
FYI: These tests are showing failures in unit tests with Compat_KokkosCudaWrapperNode
in the name of the unit test 4 times from 9/1/2019 through 10/10/2019 are shown in this query showing:
Test Name | Status | Time | Details | Build Time | Processors |
---|---|---|---|---|---|
MueLu_UnitTestsTpetra_MPI_1 | Failed | 13s 360ms | Completed (Failed) | 2019-10-10T03:09:44 MDT | 1 |
MueLu_UnitTestsTpetra_MPI_1 | Failed | 16s 440ms | Completed (Failed) | 2019-09-16T03:06:54 MDT | 1 |
MueLu_UnitTestsTpetra_MPI_1 | Failed | 16s 370ms | Completed (Failed) | 2019-09-15T03:04:21 MDT | 1 |
MueLu_UnitTestsTpetra_MPI_4 | Failed | 28s 50ms | Completed (Failed) | 2019-09-06T03:05:34 MDT | 4 |
But there are 16 failures of these tests in this build over that time period as shown in this query that show the errors:
mpiexec noticed that process rank <rankid> with PID 0 on node waterman<num> exited on signal 6 (Aborted).
and
mpiexec noticed that process rank <rankid> with PID 0 on node waterman<num> exited on signal 9 (Killed).
Least one may think these are just random failures that impact more than just MueLu tests in this build this query shows that only MueLu tests are showing this error (20 in all). There are no Panzer, Tempus or other downstream packages that show these errors in this build. The set of MueLu tests showing this are:
Test Name | Status | Time | Details | Build Time | Processors |
---|---|---|---|---|---|
MueLu_BlockCrs-Tpetra_MPI_4 | Failed | 5s 600ms | Completed (Failed) | 2019-10-10T03:09:44 MDT | 4 |
MueLu_DriverTpetra_WithGlobalConstants_MPI_4 | Failed | 6s 210ms | Completed (Failed) | 2019-09-22T03:06:15 MDT | 4 |
MueLu_ImportPerformance_Tpetra_MPI_4 | Failed | 4s 320ms | Completed (Failed) | 2019-10-04T03:05:06 MDT | 4 |
MueLu_ParameterListInterpreterTpetra_MPI_1 | Failed | 1m 18s 800ms | Completed (Failed) | 2019-10-06T03:09:36 MDT | 1 |
MueLu_ParameterListInterpreterTpetra_MPI_1 | Failed | 1m 11s 280ms | Completed (Failed) | 2019-09-27T03:07:24 MDT | 1 |
MueLu_ParameterListInterpreterTpetra_MPI_1 | Failed | 1m 29s 720ms | Completed (Failed) | 2019-09-21T03:05:07 MDT | 1 |
MueLu_ParameterListInterpreterTpetra_MPI_1 | Failed | 1m 32s 10ms | Completed (Failed) | 2019-09-20T03:06:56 MDT | 1 |
MueLu_ParameterListInterpreterTpetra_MPI_4 | Failed | 44s 560ms | Completed (Failed) | 2019-09-24T03:06:13 MDT | 4 |
MueLu_SimpleTpetra_MPI_4 | Failed | 5s 200ms | Completed (Failed) | 2019-09-13T03:06:22 MDT | 4 |
MueLu_SimpleTpetraYaml_MPI_4 | Failed | 5s 510ms | Completed (Failed) | 2019-09-23T03:04:03 MDT | 4 |
MueLu_SimpleTpetraYaml_MPI_4 | Failed | 5s 100ms | Completed (Failed) | 2019-09-22T03:06:15 MDT | 4 |
MueLu_SimpleTpetraYaml_MPI_4 | Failed | 4s 520ms | Completed (Failed) | 2019-09-14T03:06:46 MDT | 4 |
MueLu_Structured_Laplace2D_Shift_Tpetra_MPI_4 | Failed | 5s 200ms | Completed (Failed) | 2019-10-05T03:04:22 MDT | 4 |
MueLu_Structured_Laplace2D_Tpetra_MPI_4 | Failed | 5s 110ms | Completed (Failed) | 2019-09-23T03:04:03 MDT | 4 |
MueLu_Structured_Laplace2D_Tpetra_MPI_4 | Failed | 5s 740ms | Completed (Failed) | 2019-09-21T03:05:07 MDT | 4 |
MueLu_UnitTestsTpetra_MPI_1 | Failed | 13s 360ms | Completed (Failed) | 2019-10-10T03:09:44 MDT | 1 |
MueLu_UnitTestsTpetra_MPI_1 | Failed | 16s 440ms | Completed (Failed) | 2019-09-16T03:06:54 MDT | 1 |
MueLu_UnitTestsTpetra_MPI_1 | Failed | 16s 370ms | Completed (Failed) | 2019-09-15T03:04:21 MDT | 1 |
MueLu_UnitTestsTpetra_MPI_4 | Failed | 28s 50ms | Completed (Failed) | 2019-09-06T03:05:34 MDT | 4 |
MueLu_VarDofDriver_MPI_2 | Failed | 15s 880ms | Completed (Failed) | 2019-10-10T03:09:44 MDT | 2 |
Bug Report
CC: @trilinos/muelu, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
Next Action Status
New commits on 2019-06-02
``` *** Base Git Repo: Trilinos 3f8ed2b: Merge remote-tracking branch 'origin/develop' into atdm-nightly Author: Roscoe A. Bartlett