trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.22k stars 568 forks source link

KokkosCore_UnitTest_CudaTimingBased_MPI_1 failing intermittently in ATDM Trilinos build starting 2020-11-26 #8545

Closed e10harvey closed 3 years ago

e10harvey commented 3 years ago

CC: @trilinos/kokkos, @crtrott (Trilinos Data Services Product Lead), @bartlettroscoe

Next Action Status

## Description As shown in [this query](https://testing-dev.sandia.gov/cdash/queryTests.php?project=Trilinos&begin=2020-11-26&end=2021-01-05&filtercount=7&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=testname&compare2=61&value2=KokkosCore_UnitTest_CudaTimingBased_MPI_1&field3=status&compare3=61&value3=failed&field4=testoutput&compare4=94&value4=Error%20initializing%20RM%20connection.%20Exiting&field5=testoutput&compare5=96&value5=srun%3A%20error%3A%20s_p_parse_file%3A%20unable%20to%20read%20.%2Fetc%2Fslurm%2Fslurm.conf.%3A%20Permission%20denied&field6=testoutput&compare6=96&value6=cudaGetDeviceCount.*cudaErrorUnknown.*unknown%20error.*Kokkos_Cuda_Instance.cpp&field7=testoutput&compare7=97&value7=FAILED.*cuda.debug_serial_execution%7CFAILED.*cuda.debug_pin_um_to_host) (click "Shown Matching Output" in upper right) the tests: * `KokkosCore_UnitTest_CudaTimingBased_MPI_1` in the builds: * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_complex_static_opt` * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_complex_static_opt_cuda-aware-mpi` * `Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_dbg` * `Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_opt` * `Trilinos-atdm-cee-rhel6_mini_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_static_dbg` * `Trilinos-atdm-cee-rhel7_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_opt` * `Trilinos-atdm-cee-rhel7_mini_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_static_dbg` * `Trilinos-atdm-cee-rhel7_mini_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_static_opt` * `Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug` started failing on testing day 2020-11-26. All of the tests show the following output like shown [here](https://testing-dev.sandia.gov/cdash/test/39961513): ``` [ FAILED ] cuda.debug_pin_um_to_host ``` or [here](https://testing-dev.sandia.gov/cdash/test/40008685): ``` [ FAILED ] cuda.debug_serial_execution ``` ## Current Status on CDash Run the [above query](https://testing-dev.sandia.gov/cdash/queryTests.php?project=Trilinos&begin=2020-11-26&end=2021-01-05&filtercount=7&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=testname&compare2=61&value2=KokkosCore_UnitTest_CudaTimingBased_MPI_1&field3=status&compare3=61&value3=failed&field4=testoutput&compare4=94&value4=Error%20initializing%20RM%20connection.%20Exiting&field5=testoutput&compare5=96&value5=srun%3A%20error%3A%20s_p_parse_file%3A%20unable%20to%20read%20.%2Fetc%2Fslurm%2Fslurm.conf.%3A%20Permission%20denied&field6=testoutput&compare6=96&value6=cudaGetDeviceCount.*cudaErrorUnknown.*unknown%20error.*Kokkos_Cuda_Instance.cpp&field7=testoutput&compare7=97&value7=FAILED.*cuda.debug_serial_execution%7CFAILED.*cuda.debug_pin_um_to_host) adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day. ## Steps to Reproduce One should be able to reproduce this failure as described in: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md and the system-specific instructions at: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#specific-instructions-for-each-system Just log into any of the associated machines and copy and paste the full CDash build name `` listed above and run commands like: ``` $ cd / $ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh $ cmake \ -GNinja \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \ -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_=ON \ $TRILINOS_DIR $ make NP=16 $ ctest -j4 ``` where `` is any package that you want to enable to reproduce build and/or test results. Again, for exact system-specific details on what commands to run to build and run tests, see: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#specific-instructions-for-each-system If you can't figure out what commands to run to reproduce the problem given this documentation, then please post a comment here and we will give you the exact minimal commands.
grover-trilinos commented 3 years ago

Test results for issue #8545 as of 2021-01-10

Tests with issue trackers Passed: twip=4
Tests with issue trackers Failed: twif=3

Detailed test results: (click to expand)

Tests with issue trackers Passed: twip=4

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Passed Completed 24 0 24 #8545
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Passed Completed 24 0 24 #8545
cee-rhel7 Trilinos-atdm-cee-rhel7_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­shared_­dbg KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Passed Completed 5 0 5 #8545
cee-rhel7 Trilinos-atdm-cee-rhel7_­mini_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­static_­dbg KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Passed Completed 2 2 3 #8545

Tests with issue trackers Failed: twif=3

Site Build Name Test Name Status Details Consec­utive Non-pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
cee-rhel7 Trilinos-atdm-cee-rhel7_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­shared_­opt KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Failed Completed (Failed) 2 2 3 #8545
cee-rhel7 Trilinos-atdm-cee-rhel7_­mini_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­static_­opt KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Failed Completed (Failed) 1 1 4 #8545
sems-rhel7 Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Failed Completed (Failed) 1 2 8 #8545

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

grover-trilinos commented 3 years ago

Test results for issue #8545 as of 2021-01-17

Tests with issue trackers Passed: twip=6
Tests with issue trackers Failed: twif=1

Detailed test results: (click to expand)

Tests with issue trackers Passed: twip=6

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Passed Completed 3 1 23 #8545
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Passed Completed 3 1 23 #8545
cee-rhel7 Trilinos-atdm-cee-rhel7_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­shared_­dbg KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Passed Completed 1 1 11 #8545
cee-rhel7 Trilinos-atdm-cee-rhel7_­mini_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­static_­dbg KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Passed Completed 4 3 9 #8545
cee-rhel7 Trilinos-atdm-cee-rhel7_­mini_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­static_­opt KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Passed Completed 7 1 11 #8545
sems-rhel7 Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Passed Completed 1 4 9 #8545

Tests with issue trackers Failed: twif=1

Site Build Name Test Name Status Details Consec­utive Non-pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
cee-rhel7 Trilinos-atdm-cee-rhel7_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­shared_­opt KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Failed Completed (Failed) 1 4 8 #8545

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

crtrott commented 3 years ago

I thought we agreed that Trilinos would NOT run the TimingBased tests. That is why we pulled this out into its own thing.

e10harvey commented 3 years ago

I thought we agreed that Trilinos would NOT run the TimingBased tests. That is why we pulled this out into its own thing.

Can you share that thread? I can assign this issue to myself and disable this test if that's the direction you'd like to move in.

CC: @bartlettroscoe

bartlettroscoe commented 3 years ago

I thought we agreed that Trilinos would NOT run the TimingBased tests. That is why we pulled this out into its own thing.

@crtrott, @e10harvey

FYI, as shown in this query this is the only issue mentioning this test.

Just need to add disables for these tests to the file:

as explained in:

Let me know if you have any questions about that or any of the documentation under:

or

crtrott commented 3 years ago

Ah sorry, we probably should have made that more explicit: Adding this test was the reaction and resolution to #6799

bartlettroscoe commented 3 years ago

Need feedback from CDash before closing

grover-trilinos commented 3 years ago

Test results for issue #8545 as of 2021-01-24

Tests with issue trackers Missing: twim=5

Detailed test results: (click to expand)

Tests with issue trackers Missing: twim=5

Site Build Name Test Name Status Details Consec­utive Missing Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
cee-rhel7 Trilinos-atdm-cee-rhel7_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­shared_­dbg KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Missing Missing 2 2 15 #8545
cee-rhel7 Trilinos-atdm-cee-rhel7_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­shared_­opt KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Missing Missing 2 4 13 #8545
cee-rhel7 Trilinos-atdm-cee-rhel7_­mini_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­static_­dbg KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Missing Missing 2 5 12 #8545
cee-rhel7 Trilinos-atdm-cee-rhel7_­mini_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­static_­opt KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Missing Missing 2 1 16 #8545
sems-rhel7 Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Missing Missing 2 6 12 #8545

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

grover-trilinos commented 3 years ago

Test results for issue #8545 as of 2021-01-31

Tests with issue trackers Missing: twim=7

Detailed test results: (click to expand)

Tests with issue trackers Missing: twim=7

Site Build Name Test Name Status Details Consec­utive Missing Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Missing Missing 9 1 18 #8545
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Missing Missing 9 1 18 #8545
cee-rhel7 Trilinos-atdm-cee-rhel7_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­shared_­dbg KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Missing Missing 9 2 15 #8545
cee-rhel7 Trilinos-atdm-cee-rhel7_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­shared_­opt KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Missing Missing 9 4 13 #8545
cee-rhel7 Trilinos-atdm-cee-rhel7_­mini_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­static_­dbg KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Missing Missing 9 5 12 #8545
cee-rhel7 Trilinos-atdm-cee-rhel7_­mini_­cuda-10.1.243_­gnu-7.2.0_­openmpi-4.0.3_­static_­opt KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Missing Missing 9 1 16 #8545
sems-rhel7 Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug KokkosCore_­UnitTest_­CudaTimingBased_­MPI_­1 Missing Missing 9 6 12 #8545

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

e10harvey commented 3 years ago

These tests have been disabled since 01-22-2021 as shown in this query.