trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.19k stars 564 forks source link

MueLu_UnitTestsTpetra_MPI* tests failing in build Trilinos-atdm-waterman_cuda-9.2_shared_opt starting 2019-06-02 #5310

Closed fryeguy52 closed 4 years ago

fryeguy52 commented 5 years ago

Bug Report

CC: @trilinos/muelu, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52

Next Action Status

## Description As shown in [this query](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-waterman_cuda-9.2_shared_opt&field2=testname&compare2=65&value2=MueLu_UnitTestsTpetra_MPI_&field3=site&compare3=61&value3=waterman&field4=buildstarttime&compare4=84&value4=2019-06-05T04%3A01%3A00UTC&field5=buildstarttime&compare5=83&value5=2019-05-05T04%3A01%3A00UTC) the tests: * MueLu_UnitTestsTpetra_MPI_1 * MueLu_UnitTestsTpetra_MPI_4 are failing in the build since 2019-06-02: * Trilinos-atdm-waterman_cuda-9.2_shared_opt
New commits on 2019-06-02 ``` *** Base Git Repo: Trilinos 3f8ed2b: Merge remote-tracking branch 'origin/develop' into atdm-nightly Author: Roscoe A. Bartlett Date: Sat Jun 1 21:05:19 2019 -0600 d7322ba: Merge pull request #5287 from william76/xpetra-eti-TpetraBlockCrsMatrix-v001 Author: Chris Siefert Date: Sat Jun 1 13:52:33 2019 -0600 e873f77: Tpetra: Allow CrsMatrix with StaticProfile to resize during import/export (#5268) Author: Tim Fuller Date: Sat Jun 1 08:29:53 2019 -0600 M packages/tpetra/core/src/Tpetra_CrsGraph_decl.hpp M packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp M packages/tpetra/core/src/Tpetra_CrsMatrix_decl.hpp M packages/tpetra/core/src/Tpetra_CrsMatrix_def.hpp M packages/tpetra/core/src/Tpetra_Details_crsUtils.hpp M packages/tpetra/core/test/CrsMatrix/CMakeLists.txt A packages/tpetra/core/test/CrsMatrix/CrsMatrix_StaticImportExport.cpp M packages/tpetra/core/test/CrsMatrix/Tpetra_Test_CrsMatrix_WithGraph.hpp dd2d23e: Xpetra: ETI TpetraBlockCrsMatrix bug fixes #4 (compiles) Author: William McLendon Date: Thu May 30 17:16:10 2019 -0600 M packages/muelu/test/unit_tests/BlackBoxPFactory.cpp M packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix_decl.hpp M packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix_def.hpp b738ce6: Xpetra: ETI TpetraBlockCrsMatrix bug fixes #3 Author: William McLendon Date: Thu May 30 09:19:32 2019 -0600 M packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix_decl.hpp M packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix_def.hpp 95840ee: Xpetra: ETI TpetraBlockCrsMatrix bug fixes #2 Author: William McLendon Date: Wed May 29 17:35:14 2019 -0600 M packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix_decl.hpp M packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix_def.hpp 488da42: Xpetra: ETI TpetraBlockCrsMatrix bug fixes #1 Author: William McLendon Date: Tue May 28 18:03:22 2019 -0600 A packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix_decl.hpp A packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix_def.hpp df2dc43: Xpetra: ETI TpetraBlockCrsMatrix initial commit Author: William McLendon Date: Tue May 28 17:40:05 2019 -0600 M packages/xpetra/src/CMakeLists.txt D packages/xpetra/src/CrsMatrix/Xpetra_TpetraBlockCrsMatrix.hpp M packages/xpetra/src/Utils/ClassList/SC-LO-GO-NO.classList M packages/xpetra/src/Utils/ExplicitInstantiation/ETI_SC_LO_GO_NO_classes.cmake ```
## Current Status on CDash The current status of these tests can be found [here](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercount=5&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-waterman_cuda-9.2_shared_opt&field2=testname&compare2=65&value2=MueLu_UnitTestsTpetra_MPI_&field3=site&compare3=61&value3=waterman&field4=buildstarttime&compare4=84&value4=today&field5=buildstarttime&compare5=83&value5=yesterday) ## Steps to Reproduce One should be able to reproduce this failure on as described in: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md More specifically, the commands given for are provided at: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md# The exact commands to reproduce this issue should be: ``` $ cd / $ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-waterman_cuda-9.2_shared_opt $ cmake \ -GNinja \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \ -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_MueLu=ON \ $TRILINOS_DIR $ make NP=16 $ ```
bartlettroscoe commented 5 years ago

@trilinos/muelu,

These are showing errors like shown here showing:

 ...
 STS::magnitude(diagVec->norm1() - diagVec->getGlobalLength()) < 100*TMT::eps() = false == true = true : FAILED ==> /gpfs1/jenkins/serrano-slave/workspace/Trilinos-atdm-serrano-intel-opt-openmp/SRC_AND_BUILD/Trilinos/packages/muelu/test/unit_tests_kokkos/TentativePFactory_kokkos.cpp:362
 ...
 [FAILED]  (0.00249 sec) TentativePFactory_kokkos_double_int_int_Kokkos_Compat_KokkosOpenMPWrapperNode_MakeTentativeVectorBasedUsingDefaultNullSpace_UnitTest
 Location: /gpfs1/jenkins/serrano-slave/workspace/Trilinos-atdm-serrano-intel-opt-openmp/SRC_AND_BUILD/Trilinos/packages/muelu/test/unit_tests_kokkos/TentativePFactory_kokkos.cpp:264

Can someone please update this unit testing code to use TEST_FLOATING_EQUALITY() so we can see the actual numbers to see how far this is failing by? For example, you could use:

TEST_FLOATING_EQUALITY(
   STS::magnitude(diagVec->norm1()),
   STS::magnitude(diagVec->getGlobalLength()),
   STS::magnitude(100*TMT::eps()), 

That would print the numbers being compared and the tolerance so we can see why this is failing.

lucbv commented 5 years ago

@bartlettroscoe sure I can work on this, I have modified a few things in MueLu recently to start using the epsilon function from Teuchos instead of hard coding a value and it could be the reason for this failure.

lucbv commented 5 years ago

@bartlettroscoe @fryeguy52 I have a PR in progress that updates a few tests in MueLu that were failing for various reasons. Among them the offending unit-test pointed to above.

I also want to point at that the pull request auto-tester does not seem to turn on the Teuchos_GLOBALLY_REDUCE_UNITTEST_RESULTS which seems odd and potentially dangerous!

bartlettroscoe commented 5 years ago

@lucbv said:

I also want to point at that the pull request auto-tester does not seem to turn on the Teuchos_GLOBALLY_REDUCE_UNITTEST_RESULTS which seems odd and potentially dangerous!

Right. Someone needs to clean up all of the flaky Trilinos tests so we can enable that. But for tests you control, just use the unit test driver Teuchos_StandardParallelUnitTestMain.cpp will will globally reduce unit test results.

lucbv commented 5 years ago

@fryeguy52 @bartlettroscoe there is some progress, at least the serial test is now passing see this query. @csiefer2 do you have any time to look at the issue with the RAPShift factory on waterman?

bartlettroscoe commented 5 years ago

FYI: Still lots of random failures of these tests in the build Trilinos-atdm-waterman_cuda-9.2_shared_opt as shown here. The test MueLu_UnitTestsTpetra_MPI_1 failed once in the last 10 days and the test MueLu_UnitTestsTpetra_MPI_4 failed 8 times in the last 10 days.

But as shown in this query it looks like the build Trilinos-atdm-waterman_cuda-9.2_shared_opt is the only build where these tests failed in the last 10 days.

bartlettroscoe commented 4 years ago

FYI: These tests are showing failures in unit tests with Compat_KokkosCudaWrapperNode in the name of the unit test 4 times from 9/1/2019 through 10/10/2019 are shown in this query showing:

Test Name Status Time Details Build Time Processors
MueLu_UnitTestsTpetra_MPI_1 Failed 13s 360ms Completed (Failed) 2019-10-10T03:09:44 MDT 1
MueLu_UnitTestsTpetra_MPI_1 Failed 16s 440ms Completed (Failed) 2019-09-16T03:06:54 MDT 1
MueLu_UnitTestsTpetra_MPI_1 Failed 16s 370ms Completed (Failed) 2019-09-15T03:04:21 MDT 1
MueLu_UnitTestsTpetra_MPI_4 Failed 28s 50ms Completed (Failed) 2019-09-06T03:05:34 MDT 4

But there are 16 failures of these tests in this build over that time period as shown in this query that show the errors:

mpiexec noticed that process rank <rankid> with PID 0 on node waterman<num> exited on signal 6 (Aborted).

and

mpiexec noticed that process rank <rankid> with PID 0 on node waterman<num> exited on signal 9 (Killed).

Least one may think these are just random failures that impact more than just MueLu tests in this build this query shows that only MueLu tests are showing this error (20 in all). There are no Panzer, Tempus or other downstream packages that show these errors in this build. The set of MueLu tests showing this are:

Test Name Status Time Details Build Time Processors
MueLu_BlockCrs-Tpetra_MPI_4 Failed 5s 600ms Completed (Failed) 2019-10-10T03:09:44 MDT 4
MueLu_DriverTpetra_WithGlobalConstants_MPI_4 Failed 6s 210ms Completed (Failed) 2019-09-22T03:06:15 MDT 4
MueLu_ImportPerformance_Tpetra_MPI_4 Failed 4s 320ms Completed (Failed) 2019-10-04T03:05:06 MDT 4
MueLu_ParameterListInterpreterTpetra_MPI_1 Failed 1m 18s 800ms Completed (Failed) 2019-10-06T03:09:36 MDT 1
MueLu_ParameterListInterpreterTpetra_MPI_1 Failed 1m 11s 280ms Completed (Failed) 2019-09-27T03:07:24 MDT 1
MueLu_ParameterListInterpreterTpetra_MPI_1 Failed 1m 29s 720ms Completed (Failed) 2019-09-21T03:05:07 MDT 1
MueLu_ParameterListInterpreterTpetra_MPI_1 Failed 1m 32s 10ms Completed (Failed) 2019-09-20T03:06:56 MDT 1
MueLu_ParameterListInterpreterTpetra_MPI_4 Failed 44s 560ms Completed (Failed) 2019-09-24T03:06:13 MDT 4
MueLu_SimpleTpetra_MPI_4 Failed 5s 200ms Completed (Failed) 2019-09-13T03:06:22 MDT 4
MueLu_SimpleTpetraYaml_MPI_4 Failed 5s 510ms Completed (Failed) 2019-09-23T03:04:03 MDT 4
MueLu_SimpleTpetraYaml_MPI_4 Failed 5s 100ms Completed (Failed) 2019-09-22T03:06:15 MDT 4
MueLu_SimpleTpetraYaml_MPI_4 Failed 4s 520ms Completed (Failed) 2019-09-14T03:06:46 MDT 4
MueLu_Structured_Laplace2D_Shift_Tpetra_MPI_4 Failed 5s 200ms Completed (Failed) 2019-10-05T03:04:22 MDT 4
MueLu_Structured_Laplace2D_Tpetra_MPI_4 Failed 5s 110ms Completed (Failed) 2019-09-23T03:04:03 MDT 4
MueLu_Structured_Laplace2D_Tpetra_MPI_4 Failed 5s 740ms Completed (Failed) 2019-09-21T03:05:07 MDT 4
MueLu_UnitTestsTpetra_MPI_1 Failed 13s 360ms Completed (Failed) 2019-10-10T03:09:44 MDT 1
MueLu_UnitTestsTpetra_MPI_1 Failed 16s 440ms Completed (Failed) 2019-09-16T03:06:54 MDT 1
MueLu_UnitTestsTpetra_MPI_1 Failed 16s 370ms Completed (Failed) 2019-09-15T03:04:21 MDT 1
MueLu_UnitTestsTpetra_MPI_4 Failed 28s 50ms Completed (Failed) 2019-09-06T03:05:34 MDT 4
MueLu_VarDofDriver_MPI_2 Failed 15s 880ms Completed (Failed) 2019-10-10T03:09:44 MDT 2
cgcgcg commented 4 years ago

Looks like it's resolved. https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&begin=2020-04-01&end=NOW&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-waterman_cuda-9.2_shared_opt&field2=testname&compare2=65&value2=MueLu_UnitTestsTpetra_MPI_&field3=site&compare3=61&value3=waterman