trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.2k stars 563 forks source link

NOX_Tpetra_HouseholderBorderedSolve_MPI_4 failing randomly in many ATDM Trilinos builds starting before 2020-10-03 #8492

Closed bartlettroscoe closed 2 years ago

bartlettroscoe commented 3 years ago

CC: @trilinos/nox, @rppawlo (Trilinos Nonlinear Product Lead)

## Next Action Status ## Description As shown in [this query](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&begin=2020-05-01&end=2020-12-14&filtercount=5&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=buildname&compare2=65&value2=Trilinos-atdm-&field3=testname&compare3=61&value3=NOX_Tpetra_HouseholderBorderedSolve_MPI_4&field4=status&compare4=62&value4=passed&field5=testoutput&compare5=97&value5=solver-.getSolverStatistics..-.numNonlinearIterations%20%3D%20.*%20%3D%3D%205%20%3D%205%20%3A%20FAILED%20%3D%3D.%20.*Tpetra_HouseholderBorderedSolve.cpp) (click "Shown Matching Output" in upper right) the test: * `NOX_Tpetra_HouseholderBorderedSolve_MPI_4` in the builds: * `Trilinos-atdm-cee-rhel6_intel-19.0.3_mpich2-3.2_openmp_static_opt` * `Trilinos-atdm-cts1-intel-19.0.4_openmpi-4.0.3_openmp_static_opt` * `Trilinos-atdm-sems-rhel6-intel-17.0.1-openmp-release` * `Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-release-debug` * `Trilinos-atdm-sems-rhel7-clang-7.0.1-openmp-shared-release-debug` * `Trilinos-atdm-sems-rhel7-clang-7.0.1-openmp-shared-release` * `Trilinos-atdm-van1-tx2_arm-20.0_openmpi-4.0.2_openmp_static_opt` * `Trilinos-atdm-van1-tx2_arm-20.1_openmpi-4.0.3_openmp_static_opt` * `Trilinos-atdm-ats1-knl_intel-19.0.4_mpich-7.7.15_openmp_static_opt` is randomly failing starting before 2020-10-03. As shown in [this query](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&begin=2020-05-01&end=2020-12-14&filtercount=3&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=buildname&compare2=65&value2=Trilinos-atdm-&field3=testname&compare3=61&value3=NOX_Tpetra_HouseholderBorderedSolve_MPI_4), it failed 42 times out of 2227 test runs over this time period with the first failure coming on 2020-10-03. But CDash history only goes back to 2020-10-03 so it likely started randomly failing before that. When it does fail, it shows different numbers of iterations with error messages like shown [here](https://testing.sandia.gov/cdash/test/46896937) showing: ``` solver->getSolverStatistics()->numNonlinearIterations = 8 == 5 = 5 : FAILED ==> /lustre/jenkins/stria/workspace/Trilinos-atdm-van1-tx2_arm-20.1_openmpi-4.0.3_openmp_static_opt/SRC_AND_BUILD/Trilinos/packages/nox/test/tpetra/tTpetra_HouseholderBorderedSolve.cpp:292 ``` ## Current Status on CDash Run the [above query](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&begin=2020-05-01&end=2021-02-8&filtercount=5&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=buildname&compare2=65&value2=Trilinos-atdm-&field3=testname&compare3=61&value3=NOX_Tpetra_HouseholderBorderedSolve_MPI_4&field4=status&compare4=62&value4=passed&field5=testoutput&compare5=97&value5=solver-.getSolverStatistics..-.numNonlinearIterations%20%3D%20.*%20%3D%3D%205%20%3D%205%20%3A%20FAILED%20%3D%3D.%20.*Tpetra_HouseholderBorderedSolve.cpp) adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day. ## Steps to Reproduce One should be able to reproduce this failure as described in: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md and the system-specific instructions at: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#specific-instructions-for-each-system Just log into any of the associated machines and copy and paste the full CDash build name `` listed above and run commands like: ``` $ cd / $ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh $ cmake \ -GNinja \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \ -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_=ON \ $TRILINOS_DIR $ make NP=16 $ ctest -j4 ``` where `` is any package that you want to enabled to reproduce build and/or test results. Again, for exact system-specific details on what commands to run to build and run tests, see: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#specific-instructions-for-each-system And if you can't figure out what commands to run to produce the issue given the above-referenced documentation, please post a comment here and we will give you the exact minimal commands to reproduce the failures.