trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.2k stars 564 forks source link

KokkosCore_UnitTest_CudaInterOpInit_MPI_1 failing in ATDM Trilinos builds starting before 2020-07-08 #8543

Closed e10harvey closed 3 years ago

e10harvey commented 3 years ago

CC: @trilinos/kokkos, @crtrott (Trilinos Data Services Product Lead), @bartlettroscoe

Next Action Status

## Description As shown in [this query](https://testing-dev.sandia.gov/cdash/queryTests.php?project=Trilinos&begin=2020-07-08&end=2021-01-05&filtercount=7&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=buildname&compare2=65&value2=Trilinos-atdm-&field3=testname&compare3=61&value3=KokkosCore_UnitTest_CudaInterOpInit_MPI_1&field4=status&compare4=61&value4=failed&field5=testoutput&compare5=94&value5=Error%20initializing%20RM%20connection.%20Exiting&field6=testoutput&compare6=96&value6=srun%3A%20error%3A%20s_p_parse_file%3A%20unable%20to%20read%20.%2Fetc%2Fslurm%2Fslurm.conf.%3A%20Permission%20denied&field7=testoutput&compare7=96&value7=cudaGetDeviceCount.*cudaErrorUnknown.*unknown%20error.*Kokkos_Cuda_Instance.cpp) (click "Shown Matching Output" in upper right) the tests: * `KokkosCore_UnitTest_CudaInterOpInit_MPI_1` in the builds: * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_complex_static_opt` * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_complex_static_opt_cuda-aware-mpi` * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg` * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg_cuda-aware-mpi` * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt` * `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt_cuda-aware-mpi` * `Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug` * `Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug` * `Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release` * `Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug` started failing on testing day 2020-07-08. All of the tests on 'vortex' show the following output like shown [here](https://testing-dev.sandia.gov/cdash/test/39998932): ``` ERROR: One or more process (first noticed rank 0) terminated with signal 6 ``` All of the tests on 'ride' show the following output like shown [here](https://testing-dev.sandia.gov/cdash/test/40026178): ``` Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize() unknown file: Failure C++ exception with description "cudaGetLastError() error( cudaErrorIllegalAddress): an illegal memory access was encountered ``` ## Current Status on CDash Run the [above query](https://testing-dev.sandia.gov/cdash/queryTests.php?project=Trilinos&begin=2020-07-08&end=2021-01-05&filtercount=7&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=buildname&compare2=65&value2=Trilinos-atdm-&field3=testname&compare3=61&value3=KokkosCore_UnitTest_CudaInterOpInit_MPI_1&field4=status&compare4=61&value4=failed&field5=testoutput&compare5=94&value5=Error%20initializing%20RM%20connection.%20Exiting&field6=testoutput&compare6=96&value6=srun%3A%20error%3A%20s_p_parse_file%3A%20unable%20to%20read%20.%2Fetc%2Fslurm%2Fslurm.conf.%3A%20Permission%20denied&field7=testoutput&compare7=96&value7=cudaGetDeviceCount.*cudaErrorUnknown.*unknown%20error.*Kokkos_Cuda_Instance.cpp) adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day. ## Steps to Reproduce One should be able to reproduce this failure as described in: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md and the system-specific instructions at: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#specific-instructions-for-each-system Just log into any of the associated machines and copy and paste the full CDash build name `` listed above and run commands like: ``` $ cd / $ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh $ cmake \ -GNinja \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \ -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_=ON \ $TRILINOS_DIR $ make NP=16 $ ctest -j4 ``` where `` is any package that you want to enable to reproduce build and/or test results. Again, for exact system-specific details on what commands to run to build and run tests, see: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#specific-instructions-for-each-system If you can't figure out what commands to run to reproduce the problem given this documentation, then please post a comment here and we will give you the exact minimal commands.
bartlettroscoe commented 3 years ago

FYI: As shown in this query (click "Show Matching Output" in upper right) a lot of these are failing with the error message:

ERROR:  One or more process (first noticed rank 0) terminated with signal 6

I think this happens when the test is not run on GPU 0 (which the updated CTest driver will do to spread out over the GPUs as per #6840).

grover-trilinos commented 3 years ago

Test results for issue #8543 as of 2021-01-10

Tests with issue trackers Passed: twip=5
Tests with issue trackers Failed: twif=5

Detailed test results: (click to expand)

Tests with issue trackers Passed: twip=5

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 2 12 12 #8543
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 2 11 13 #8543
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 2 17 6 #8543
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 2 5 19 #8543
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 3 7 17 #8543

Tests with issue trackers Failed: twif=5

Site Build Name Test Name Status Details Consec­utive Non-pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Failed Completed (Failed) 1 10 14 #8543
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Failed Completed (Failed) 2 7 17 #8543
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Failed Completed (Failed) 1 4 19 #8543
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Failed Completed (Failed) 1 10 14 #8543
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Failed Completed (Failed) 1 9 15 #8543

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

grover-trilinos commented 3 years ago

Test results for issue #8543 as of 2021-01-17

Tests with issue trackers Passed: twip=5
Tests with issue trackers Failed: twif=5

Detailed test results: (click to expand)

Tests with issue trackers Passed: twip=5

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 3 11 13 #8543
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 8 9 14 #8543
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 2 13 9 #8543
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 1 7 17 #8543
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 2 10 14 #8543

Tests with issue trackers Failed: twif=5

Site Build Name Test Name Status Details Consec­utive Non-pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Failed Completed (Failed) 3 12 12 #8543
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Failed Completed (Failed) 4 8 15 #8543
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Failed Completed (Failed) 1 6 16 #8543
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Failed Completed (Failed) 1 7 17 #8543
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Failed Completed (Failed) 1 10 14 #8543

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

bartlettroscoe commented 3 years ago

Need feedback from CDash before closing

grover-trilinos commented 3 years ago

Test results for issue #8543 as of 2021-01-24

Tests with issue trackers Passed: twip=4

Detailed test results: (click to expand)

Tests with issue trackers Passed: twip=4

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 3 7 17 #8543
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 4 6 17 #8543
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 2 7 17 #8543
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 4 8 15 #8543

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

grover-trilinos commented 3 years ago

Test results for issue #8543 as of 2021-01-31

Tests with issue trackers Passed: twip=10

Detailed test results: (click to expand)

Tests with issue trackers Passed: twip=10

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 7 9 17 #8543
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 15 8 18 #8543
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 19 3 21 #8543
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 7 11 13 #8543
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 7 10 13 #8543
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 13 4 19 #8543
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 10 7 21 #8543
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 11 5 22 #8543
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 9 6 22 #8543
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release KokkosCore_­UnitTest_­CudaInterOpInit_­MPI_­1 Passed Completed 11 7 20 #8543

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

e10harvey commented 3 years ago

Closing as this has been passing since 01-23-2021 as show in this query.