trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.2k stars 564 forks source link

KokkosCore_UnitTest_CudaInterOpStreams_MPI_1 failing in ATDM Trilinos builds starting before 2020-07-08 #8544

Closed e10harvey closed 3 years ago

e10harvey commented 3 years ago

CC: @trilinos/kokkos, @crtrott (Trilinos Data Services Product Lead), @bartlettroscoe

Next Action Status

## Description As shown in [this query](https://testing-dev.sandia.gov/cdash/queryTests.php?project=Trilinos&begin=2020-07-08&end=2021-01-05&filtercount=7&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=buildname&compare2=65&value2=Trilinos-atdm-&field3=testname&compare3=61&value3=KokkosCore_UnitTest_CudaInterOpStreams_MPI_1&field4=status&compare4=61&value4=failed&field5=testoutput&compare5=94&value5=Error%20initializing%20RM%20connection.%20Exiting&field6=testoutput&compare6=96&value6=srun%3A%20error%3A%20s_p_parse_file%3A%20unable%20to%20read%20.%2Fetc%2Fslurm%2Fslurm.conf.%3A%20Permission%20denied&field7=testoutput&compare7=96&value7=cudaGetDeviceCount.*cudaErrorUnknown.*unknown%20error.*Kokkos_Cuda_Instance.cpp) (click "Shown Matching Output" in upper right) the tests: * `KokkosCore_UnitTest_CudaInterOpStreams_MPI_1` in the builds: * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_complex_static_opt` * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_complex_static_opt_cuda-aware-mpi` * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg` * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg_cuda-aware-mpi` * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt` * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt_cuda-aware-mpi` * `Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug` * `Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug` * `Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release` * `Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug` started failing on testing day 2020-07-08. All of the tests in debug builds show the following output like shown [here](https://testing-dev.sandia.gov/cdash/test/40009284): ``` Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize() unknown file: Failure C++ exception with description "cudaGetLastError() error( cudaErrorInvalidResourceHandle): invalid resource handle ``` ## Current Status on CDash Run the [above query](https://testing-dev.sandia.gov/cdash/queryTests.php?project=Trilinos&begin=2020-07-08&end=2021-01-05&filtercount=7&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=buildname&compare2=65&value2=Trilinos-atdm-&field3=testname&compare3=61&value3=KokkosCore_UnitTest_CudaInterOpStreams_MPI_1&field4=status&compare4=61&value4=failed&field5=testoutput&compare5=94&value5=Error%20initializing%20RM%20connection.%20Exiting&field6=testoutput&compare6=96&value6=srun%3A%20error%3A%20s_p_parse_file%3A%20unable%20to%20read%20.%2Fetc%2Fslurm%2Fslurm.conf.%3A%20Permission%20denied&field7=testoutput&compare7=96&value7=cudaGetDeviceCount.*cudaErrorUnknown.*unknown%20error.*Kokkos_Cuda_Instance.cpp) adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day. ## Steps to Reproduce One should be able to reproduce this failure as described in: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md and the system-specific instructions at: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#specific-instructions-for-each-system Just log into any of the associated machines and copy and paste the full CDash build name `` listed above and run commands like: ``` $ cd / $ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh $ cmake \ -GNinja \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \ -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_=ON \ $TRILINOS_DIR $ make NP=16 $ ctest -j4 ``` where `` is any package that you want to enable to reproduce build and/or test results. Again, for exact system-specific details on what commands to run to build and run tests, see: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#specific-instructions-for-each-system If you can't figure out what commands to run to reproduce the problem given this documentation, then please post a comment here and we will give you the exact minimal commands
grover-trilinos commented 3 years ago

Test results for issue #8544 as of 2021-01-10

Tests with issue trackers Passed: twip=6
Tests with issue trackers Failed: twif=4

Detailed test results: (click to expand)

Tests with issue trackers Passed: twip=6

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 3 14 10 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 1 14 9 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 1 12 12 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 1 8 16 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 1 10 14 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 1 12 12 #8544

Tests with issue trackers Failed: twif=4

Site Build Name Test Name Status Details Consec­utive Non-pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Failed Completed (Failed) 9 14 10 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Failed Completed (Failed) 1 12 12 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Failed Completed (Failed) 2 11 13 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Failed Completed (Failed) 1 8 15 #8544

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

grover-trilinos commented 3 years ago

Test results for issue #8544 as of 2021-01-17

Tests with issue trackers Passed: twip=6
Tests with issue trackers Failed: twif=4

Detailed test results: (click to expand)

Tests with issue trackers Passed: twip=6

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 2 13 11 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 1 14 9 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 2 11 13 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 1 8 16 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 1 10 14 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 3 11 13 #8544

Tests with issue trackers Failed: twif=4

Site Build Name Test Name Status Details Consec­utive Non-pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Failed Completed (Failed) 4 12 12 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Failed Completed (Failed) 1 10 13 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Failed Completed (Failed) 2 12 10 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Failed Completed (Failed) 1 10 12 #8544

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

crtrott commented 3 years ago

This test as well as the one mentioned in #8543 test interoperability with raw CUDA. In particular they test situations where CUDA is already used before Kokkos initialize and/or after Kokkos finalize. As such switching the GPU ID during Kokkos initialize will lead to the observed errors. One should NOT use any mechanism to tell Kokkos to choose a specific GPU. CUDA_VISIBLE_DEVICES probably works. In practice telling Kokkos to use device id 0 will also work (just not sure that CUDA guarantees that that is the default GPU).

bartlettroscoe commented 3 years ago

One should NOT use any mechanism to tell Kokkos to choose a specific GPU. CUDA_VISIBLE_DEVICES probably works. In practice telling Kokkos to use device id 0 will also work (just not sure that CUDA guarantees that that is the default GPU).

@crtrott, that was not the appraoch/agreement we came to as part of:

Perhaps Kokkos needs to be updated to read in these CTest env vars earlier?

Changing to use CUDA_VISIBLE_DEVICES would require writing an intermediate wrapper in TriBITS for every test that read in the ctest-set env vars and set CUDA_VISIBLE_DEVICES accordingly. The design we came up with with for Ctest to not have to know about GPUs in particular and not have to modify TriBITS to coordinate the communication between CTest and Kokkos. But, again, we can extend TriBITS to do the needed translations (and perhaps we should) but that is just adding more control and complexity to TriBITS and making it a thicker wrapper of CMake/CTest.

crtrott commented 3 years ago

With one should NOT use that mechanism: I mean specifically for those two tests. As I said I would recommend either disabling these two tests, or mark them as not runnable in parallel with other tests (is that a thing you can do?).

bartlettroscoe commented 3 years ago

As I said I would recommend either disabling these two tests, or mark them as not runnable in parallel with other tests (is that a thing you can do?).

Yes and yes. For the former:

and for the latter:

As shown here, this test finished in less than 3s so I think we just need to add:

ATDM_SET_ENABLE(<fullTestName>_SET_RUN_SERIAL ON)

for each of these tests to:

right about here:

bartlettroscoe commented 3 years ago

Need feedback from CDash before closing

grover-trilinos commented 3 years ago

Test results for issue #8544 as of 2021-01-24

Tests with issue trackers Passed: twip=4

Detailed test results: (click to expand)

Tests with issue trackers Passed: twip=4

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 2 8 16 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 3 7 16 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 4 8 16 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 3 10 13 #8544

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

grover-trilinos commented 3 years ago

Test results for issue #8544 as of 2021-01-31

Tests with issue trackers Passed: twip=10

Detailed test results: (click to expand)

Tests with issue trackers Passed: twip=10

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 7 12 14 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 7 9 17 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 7 11 13 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 12 6 18 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 7 6 17 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 7 10 13 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 9 8 20 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 10 6 21 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 11 7 21 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 10 8 19 #8544

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

e10harvey commented 3 years ago

Closing as this has been passing since 01-23-2021 as shown in this query.