trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.22k stars 568 forks source link

Intrepid2_unit-test_Discretization_Basis_HierarchicalBases_Hierarchical_Basis_Tests_MPI_1 failing and timing out in 'ats2' and 'cee-rhel6' CUDA builds starting 2020-10-08 #8516

Closed bartlettroscoe closed 3 years ago

bartlettroscoe commented 3 years ago

CC: @trilinos/intrepid2, @mperego (Trilinos Discretizations Product Lead), @CamelliaDPG, @jhux2

## Next Action Status ## Description As shown in [this query](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&begin=2020-06-01&end=2020-12-22&filtercount=20&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=buildname&compare2=65&value2=Trilinos-atdm-&field3=status&compare3=62&value3=passed&field4=testoutput&compare4=94&value4=Error%20initializing%20RM%20connection.%20Exiting&field5=testoutput&compare5=96&value5=srun%3A%20error%3A%20s_p_parse_file%3A%20unable%20to%20read%20.%2Fetc%2Fslurm%2Fslurm.conf.%3A%20Permission%20denied&field6=testoutput&compare6=96&value6=cudaGetDeviceCount.*cudaErrorUnknown.*unknown%20error.*Kokkos_Cuda_Instance.cpp&field7=testoutput&compare7=96&value7=cudaMallocManaged.*cudaErrorUnknown.*unknown%20error.*Sacado_DynamicArrayTraits.hpp&field8=testoutput&compare8=96&value8=srun%3A%20error.*launch%20failed%3A%20Error%20configuring%20interconnect&field9=testoutput&compare9=94&value9=FORTRAN%20STOP&field10=testoutput&compare10=94&value10=HTS%20Test%3A%20Failed&field11=testoutput&compare11=94&value11=Test%20that%20code%20%7Bbuilder%20%3D%20Teuchos%3A%3Anull%3B%7D%20throws%20std%3A%3Alogic_error%3A%20failed%20(code%20did%20not%20throw%20an%20exception%20at%20all)&field12=testoutput&compare12=96&value12=solver-.getSolverStatistics..-.numNonlinearIterations%20%3D%20.*%20%3D%3D%205%20%3D%205%20%3A%20FAILED%20%3D%3D.%20.*Tpetra_HouseholderBorderedSolve.cpp&field13=testname&compare13=64&value13=Adelus_vector_random_&field14=testname&compare14=64&value14=ROL_example_poisson-inversion_example_01_MPI_1&field15=testname&compare15=64&value15=SEACAS&field16=testname&compare16=64&value16=ROL&field17=site&compare17=62&value17=stria&field18=testname&compare18=62&value18=Ifpack2_MTSGS_belos_MPI_1&field19=testoutput&compare19=94&value19=GPU%20awareness%20in%20PAMI%20requested&field20=testoutput&compare20=97&value20=block%3A%20.*%2C%20thread%3A%20.*%20Assertion%20.Allocation%20failed..%20failed.) (click "Shown Matching Output" in upper right) the tests: * `Intrepid2_unit-test_Discretization_Basis_HierarchicalBases_Hierarchical_Basis_Tests_MPI_1` in the builds: * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg` * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg_cuda-aware-mpi` * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt` * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt_cuda-aware-mpi` * `Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_dbg` * `Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_opt` started failing on testing day 2020-10-08. This test either fails showing errors like [here](https://testing.sandia.gov/cdash/test/48123231) showing: ``` Running unit tests ... 0. AnalyticPolynomialsMatch_Hierarchical_HVOL_LINE_UnitTest ... [Passed] (0.012 sec) 1. AnalyticPolynomialsMatch_Sacado_Fad_DFadType_Sacado_Fad_DFadType_Hierarchical_HGRAD_LINE_UnitTest ... [Passed] (0.0723 sec) 2. AnalyticPolynomialsMatch_Sacado_Fad_DFadType_Sacado_Fad_DFadType_Hierarchical_LineBasisDerivativesAgree_UnitTest ... [Passed] (0.0694 sec) 3. AnalyticPolynomialsMatch_Sacado_Fad_DFadType_Sacado_Fad_DFadType_Hierarchical_HGRAD_TRI_UnitTest ... [Passed] (0.0765 sec) 4. AnalyticPolynomialsMatch_Sacado_Fad_DFadType_Sacado_Fad_DFadType_HierarchicalNodalComparisons_UnitTest ... :0: : block: [1,54,2], thread: [10,1,1] Assertion `Allocation failed.` failed. :0: : block: [1,0,1], thread: [10,1,1] Assertion `Allocation failed.` failed. :0: : block: [1,20,1], thread: [10,1,1] Assertion `Allocation failed.` failed. :0: : block: [0,9,2], thread: [15,1,1] Assertion `Allocation failed.` failed. :0: : block: [1,61,4], thread: [10,1,1] Assertion `Allocation failed.` failed. :0: : block: [1,57,4], thread: [10,1,1] Assertion `Allocation failed.` failed. :0: : block: [0,60,4], thread: [15,1,1] Assertion `Allocation failed.` failed. terminate called after throwing an instance of 'std::runtime_error' what(): cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:143 Traceback functionality not available [vortex77:121714] *** Process received signal *** [vortex77:121714] Signal: Aborted (6) [vortex77:121714] Signal code: (-6) << Rank 0: Generating lwcore_cpu.187750_9.0 on vortex77 Sun Dec 20 08:47:15 MST 2020 (LLNL_COREDUMP_FORMAT_CPU=lwcore) >> << Rank 0: Generated lwcore_cpu.187750_9.0 on vortex77 Sun Dec 20 08:47:16 MST 2020 in 1 secs >> << Rank 0: Waiting 60 secs before aborting task on vortex77 Sun Dec 20 08:47:16 MST 2020 (LLNL_COREDUMP_WAIT_FOR_OTHERS=60) >> << Rank 0: Waited 60 secs -> now aborting task on vortex77 Sun Dec 20 08:48:16 MST 2020 (LLNL_COREDUMP_KILL=task) >> [vortex77:121714] ----------------------- [vortex77:121714] ----------------------- [vortex77:121714] *** End of error message *** ERROR: One or more process (first noticed rank 0) terminated with signal 6 jsrun return value: 134 ``` or, as shown in [this query](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&begin=2020-06-01&end=2020-12-22&filtercount=22&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=buildname&compare2=65&value2=Trilinos-atdm-&field3=status&compare3=62&value3=passed&field4=testoutput&compare4=94&value4=Error%20initializing%20RM%20connection.%20Exiting&field5=testoutput&compare5=96&value5=srun%3A%20error%3A%20s_p_parse_file%3A%20unable%20to%20read%20.%2Fetc%2Fslurm%2Fslurm.conf.%3A%20Permission%20denied&field6=testoutput&compare6=96&value6=cudaGetDeviceCount.*cudaErrorUnknown.*unknown%20error.*Kokkos_Cuda_Instance.cpp&field7=testoutput&compare7=96&value7=cudaMallocManaged.*cudaErrorUnknown.*unknown%20error.*Sacado_DynamicArrayTraits.hpp&field8=testoutput&compare8=96&value8=srun%3A%20error.*launch%20failed%3A%20Error%20configuring%20interconnect&field9=testoutput&compare9=94&value9=FORTRAN%20STOP&field10=testoutput&compare10=94&value10=HTS%20Test%3A%20Failed&field11=testoutput&compare11=94&value11=Test%20that%20code%20%7Bbuilder%20%3D%20Teuchos%3A%3Anull%3B%7D%20throws%20std%3A%3Alogic_error%3A%20failed%20(code%20did%20not%20throw%20an%20exception%20at%20all)&field12=testoutput&compare12=96&value12=solver-.getSolverStatistics..-.numNonlinearIterations%20%3D%20.*%20%3D%3D%205%20%3D%205%20%3A%20FAILED%20%3D%3D.%20.*Tpetra_HouseholderBorderedSolve.cpp&field13=testname&compare13=64&value13=Adelus_vector_random_&field14=testname&compare14=64&value14=ROL_example_poisson-inversion_example_01_MPI_1&field15=testname&compare15=64&value15=SEACAS&field16=testname&compare16=64&value16=ROL&field17=site&compare17=62&value17=stria&field18=testname&compare18=62&value18=Ifpack2_MTSGS_belos_MPI_1&field19=testoutput&compare19=94&value19=GPU%20awareness%20in%20PAMI%20requested&field20=testname&compare20=61&value20=Intrepid2_unit-test_Discretization_Basis_HierarchicalBases_Hierarchical_Basis_Tests_MPI_1&field21=details&compare21=63&value21=Timeout&field22=testoutput&compare22=96&value22=block%3A%20.*%2C%20thread%3A%20.*%20Assertion%20.Allocation%20failed..%20failed.) and [here](https://testing.sandia.gov/cdash/test/48479551), it times out showing: ``` Running unit tests ... 0. AnalyticPolynomialsMatch_Hierarchical_HVOL_LINE_UnitTest ... [Passed] (0.0143 sec) 1. AnalyticPolynomialsMatch_Sacado_Fad_DFadType_Sacado_Fad_DFadType_Hierarchical_HGRAD_LINE_UnitTest ... [Passed] (0.085 sec) 2. AnalyticPolynomialsMatch_Sacado_Fad_DFadType_Sacado_Fad_DFadType_Hierarchical_LineBasisDerivativesAgree_UnitTest ... [Passed] (0.0937 sec) 3. AnalyticPolynomialsMatch_Sacado_Fad_DFadType_Sacado_Fad_DFadType_Hierarchical_HGRAD_TRI_UnitTest ... [Passed] (0.104 sec) ``` Looking at the commits pulled testing day 2020-10-08 [here](https://testing.sandia.gov/cdash/build/8035772/notes#note1), we see the new commits: ``` 77e2815ae3e: Merge Pull Request #8158 from trilinos/Trilinos/xpetra-fix-8157 Author: trilinos-autotester Date: Wed Oct 7 16:35:52 2020 -0600 a02ee26b224: Xpetra: remove unused header Author: Jonathan Hu Date: Wed Oct 7 12:07:03 2020 -0600 M packages/xpetra/sup/Utils/Xpetra_MatrixMatrix.hpp 57f6e2d98af: Intrepid2: prototype faster Fad comparisons for tests (#8154) Author: Nate Roberts Date: Wed Oct 7 08:19:20 2020 -0600 M packages/intrepid2/src/Shared/Intrepid2_TestUtils.hpp M packages/intrepid2/unit-test/Discretization/Basis/HierarchicalBases/AnalyticPolynomialsMatchTests.cpp ``` from @CamelliaDPG and @jhux2. The most likely commit it would seem is 57f6e2d98af from @CamelliaDPG. ## Current Status on CDash Run the [above query](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&begin=2020-06-01&end=2020-12-22&filtercount=20&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=buildname&compare2=65&value2=Trilinos-atdm-&field3=status&compare3=62&value3=passed&field4=testoutput&compare4=94&value4=Error%20initializing%20RM%20connection.%20Exiting&field5=testoutput&compare5=96&value5=srun%3A%20error%3A%20s_p_parse_file%3A%20unable%20to%20read%20.%2Fetc%2Fslurm%2Fslurm.conf.%3A%20Permission%20denied&field6=testoutput&compare6=96&value6=cudaGetDeviceCount.*cudaErrorUnknown.*unknown%20error.*Kokkos_Cuda_Instance.cpp&field7=testoutput&compare7=96&value7=cudaMallocManaged.*cudaErrorUnknown.*unknown%20error.*Sacado_DynamicArrayTraits.hpp&field8=testoutput&compare8=96&value8=srun%3A%20error.*launch%20failed%3A%20Error%20configuring%20interconnect&field9=testoutput&compare9=94&value9=FORTRAN%20STOP&field10=testoutput&compare10=94&value10=HTS%20Test%3A%20Failed&field11=testoutput&compare11=94&value11=Test%20that%20code%20%7Bbuilder%20%3D%20Teuchos%3A%3Anull%3B%7D%20throws%20std%3A%3Alogic_error%3A%20failed%20(code%20did%20not%20throw%20an%20exception%20at%20all)&field12=testoutput&compare12=96&value12=solver-.getSolverStatistics..-.numNonlinearIterations%20%3D%20.*%20%3D%3D%205%20%3D%205%20%3A%20FAILED%20%3D%3D.%20.*Tpetra_HouseholderBorderedSolve.cpp&field13=testname&compare13=64&value13=Adelus_vector_random_&field14=testname&compare14=64&value14=ROL_example_poisson-inversion_example_01_MPI_1&field15=testname&compare15=64&value15=SEACAS&field16=testname&compare16=64&value16=ROL&field17=site&compare17=62&value17=stria&field18=testname&compare18=62&value18=Ifpack2_MTSGS_belos_MPI_1&field19=testoutput&compare19=94&value19=GPU%20awareness%20in%20PAMI%20requested&field20=testoutput&compare20=97&value20=block%3A%20.*%2C%20thread%3A%20.*%20Assertion%20.Allocation%20failed..%20failed.) adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day. ## Steps to Reproduce One should be able to reproduce this failure as described in: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md and the system-specific instructions at: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#specific-instructions-for-each-system Just log into any of the associated machines and copy and paste the full CDash build name `` listed above and run commands like: ``` $ cd / $ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh $ cmake \ -GNinja \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \ -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_=ON \ $TRILINOS_DIR $ make NP=16 $ ctest -j4 ``` where `` is any package that you want to enabled to reproduce build and/or test results. Again, for exact system-specific details on what commands to run to build and run tests, see: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#specific-instructions-for-each-system And if you can't figure out what commands to run to produce the issue given the above-referenced documentation, please post a comment here and we will give you the exact minimal commands to reproduce the failures.
CamelliaDPG commented 3 years ago

@bartlettroscoe, thanks for calling this to our attention. We've got a significant Intrepid2 PR (#8457) that's under review right now; this affects the test that's implicated here. This afternoon, I did set up a Trilinos build on vortex, so that I can check to see how these tests are doing in the PR. I'm taking vacation through January 4, so I will plan to look at this more when I return.

grover-trilinos commented 3 years ago

Test results for issue #8516 as of 2020-12-27

Tests with issue trackers Failed: twif=4

Detailed test results: (click to expand)

Tests with issue trackers Failed: twif=4

Site Build Name Test Name Status Details Consec­utive Non-pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg Intrepid2_­unit-test_­Discretization_­Basis_­HierarchicalBases_­Hierarchical_­Basis_­Tests_­MPI_­1 Failed Completed (Timeout) 29 29 0 #8516
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi Intrepid2_­unit-test_­Discretization_­Basis_­HierarchicalBases_­Hierarchical_­Basis_­Tests_­MPI_­1 Failed Completed (Timeout) 29 29 0 #8516
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt Intrepid2_­unit-test_­Discretization_­Basis_­HierarchicalBases_­Hierarchical_­Basis_­Tests_­MPI_­1 Failed Completed (Timeout) 29 29 0 #8516
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt_­cuda-aware-mpi Intrepid2_­unit-test_­Discretization_­Basis_­HierarchicalBases_­Hierarchical_­Basis_­Tests_­MPI_­1 Failed Completed (Timeout) 30 30 0 #8516

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

grover-trilinos commented 3 years ago

Test results for issue #8516 as of 2021-01-10

Tests with issue trackers Failed: twif=6

Detailed test results: (click to expand)

Tests with issue trackers Failed: twif=6

Site Build Name Test Name Status Details Consec­utive Non-pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg Intrepid2_­unit-test_­Discretization_­Basis_­HierarchicalBases_­Hierarchical_­Basis_­Tests_­MPI_­1 Failed Completed (Failed) 24 24 0 #8516
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi Intrepid2_­unit-test_­Discretization_­Basis_­HierarchicalBases_­Hierarchical_­Basis_­Tests_­MPI_­1 Failed Completed (Timeout) 24 24 0 #8516
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt Intrepid2_­unit-test_­Discretization_­Basis_­HierarchicalBases_­Hierarchical_­Basis_­Tests_­MPI_­1 Failed Completed (Timeout) 23 23 0 #8516
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt_­cuda-aware-mpi Intrepid2_­unit-test_­Discretization_­Basis_­HierarchicalBases_­Hierarchical_­Basis_­Tests_­MPI_­1 Failed Completed (Timeout) 23 23 0 #8516
cee-rhel7 Trilinos-atdm-cee-rhel7_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­shared_­dbg Intrepid2_­unit-test_­Discretization_­Basis_­HierarchicalBases_­Hierarchical_­Basis_­Tests_­MPI_­1 Failed Completed (Failed) 5 5 0 #8516
cee-rhel7 Trilinos-atdm-cee-rhel7_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­shared_­opt Intrepid2_­unit-test_­Discretization_­Basis_­HierarchicalBases_­Hierarchical_­Basis_­Tests_­MPI_­1 Failed Completed (Failed) 5 5 0 #8516

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

CamelliaDPG commented 3 years ago

This should be fixed as part of #8457.

CamelliaDPG commented 3 years ago

As suggested above, in my testing of #8457 on vortex, it did appear that this was resolved, but since then I've found this to be fragile. I believe #8574 is a complete, durable fix.

bartlettroscoe commented 3 years ago

CC: @E10

FYI: As shown here and here showing:

Name Status Time Proc Time Details Labels History Summary Processors
Intrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests_MPI_1 Failed 10m 40ms 10m 40ms Completed (Timeout) Intrepid2 Broken Unstable 1
Intrepid2_unit-test_Cell_TensorTopology_TensorTopologyTests_MPI_1 Missing            
Intrepid2_unit-test_Discretization_Basis_BasisEquivalenceTests_MPI_1 Missing            
Intrepid2_unit-test_Discretization_Basis_HierarchicalBases_Hierarchical_Basis_Tests_MPI_1 Missing            
Intrepid2_unit-test_Shared_Polylib_LegendreJacobiPolynomials_JacobiLegendrePolynomial_Tests_MPI_1 Missing            
Intrepid2_unit-test_Shared_ViewIterator_Hierarchical_Basis_Tests_MPI_1 Missing      

it looks like the test:

got renamed or split up because it went messing on Trilinos testing day 2021-01-11. And the new git commits shown here show the PR #8457 getting pulled in.

@e10harvey, @CamelliaDPG,

Just an FYI, but with the test name changing, trilinos_atdm_builds_status.sh will no longer report the actual status of these test but instead show this test as "Missing".

CamelliaDPG commented 3 years ago

@bartlettroscoe Thanks for the heads-up on this. We folded the test you mentioned into the MonolithicExecutable test. I see from the report above that we need to set the flag for serial running in the CUDA testing settings. I'll get a PR going for that.