trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.2k stars 563 forks source link

cuda MPI tests failing in Trilinos builds starting 2021-05-29 #9272

Closed ZUUL42 closed 2 years ago

ZUUL42 commented 3 years ago

SUMMARY: 2021-05-29

CC: @trilinos/ifpack2 @trilinos/kokkos-kernels, @jhux2 @srajama1 (Trilinos Linear Solvers & Data Services Triage Contact (or "Current ATDM contact"))

## Next Action Status ## Description As shown in [this query](https://testing-dev.sandia.gov/cdash/queryTests.php?project=Trilinos&begin=2021-05-28&end=2021-06-30&filtercount=6&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=buildname&compare2=65&value2=Trilinos-atdm-white&field3=buildname&compare3=63&value3=rdc-release-debug&field4=status&compare4=62&value4=passed&field5=testoutput&compare5=95&value5=Kokkos%3A%3AImpl%3A%3AParallelReduce&field6=testoutput&compare6=95&value6=requested%20too%20large%20team%20size.) (click "Shown Matching Output" in upper right) the tests: * `Ifpack2_unit_tests_MPI_4` * `KokkosKernels_sparse_cuda_MPI_1` in the builds: * `Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug` started failing on testing day 2021-05-29. ``` *** Caught standard std::exception of type 'std::runtime_error' : Kokkos::Impl::ParallelReduce< Cuda > requested too large team size. ``` ## Current Status on CDash Run the [above query](https://testing-dev.sandia.gov/cdash/queryTests.php?project=Trilinos&begin=2021-05-28&end=2021-06-30&filtercount=6&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=buildname&compare2=65&value2=Trilinos-atdm-white&field3=buildname&compare3=63&value3=rdc-release-debug&field4=status&compare4=62&value4=passed&field5=testoutput&compare5=95&value5=Kokkos%3A%3AImpl%3A%3AParallelReduce&field6=testoutput&compare6=95&value6=requested%20too%20large%20team%20size.) adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day. ## Steps to Reproduce One should be able to reproduce this failure as described in: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md and the system-specific instructions at: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#specific-instructions-for-each-system Just log into any of the associated machines and copy and paste the full CDash build name `` listed above and run commands like: ``` $ cd / $ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh $ cmake \ -GNinja \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \ -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_=ON \ $TRILINOS_DIR $ make NP=16 $ ctest -j4 ``` where `` is any package that you want to enable to reproduce build and/or test results. Again, for exact system-specific details on what commands to run to build and run tests, see: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#specific-instructions-for-each-system If you can't figure out what commands to run to reproduce the problem given this documentation, then please post a comment here and we will give you the exact minimal commands.
jhux2 commented 3 years ago

@brian-kelley One of the MTGS "long-row" tests is failing. Do you have time to look at this?

brian-kelley commented 3 years ago

@jhux2 I replicated this and it looks like Kokkos::AUTO() for the team size is actually ending up with too many threads (544). The TeamPolicy::team_size_recommended gives 512, and that number works.

brian-kelley commented 3 years ago

The fix is ready in #9278 .

grover-trilinos commented 3 years ago

Test results for issue #9272 as of 2021-06-13

Tests with issue trackers Failed: twif=2

Detailed test results: (click to expand)

Tests with issue trackers Failed: twif=2

Site Build Name Test Name Status Details Consec­utive Non-pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug Ifpack2_­unit_­tests_­MPI_­4 Failed Completed (Failed) 16 16 13 #9272
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug KokkosKernels_­sparse_­cuda_­MPI_­1 Failed Completed (Failed) 16 16 13 #9272

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

grover-trilinos commented 3 years ago

Test results for issue #9272 as of 2021-06-20

Tests with issue trackers Passed: twip=2

Detailed test results: (click to expand)

Tests with issue trackers Passed: twip=2

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug Ifpack2_­unit_­tests_­MPI_­4 Passed Completed 2 21 8 #9272
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug KokkosKernels_­sparse_­cuda_­MPI_­1 Passed Completed 2 21 8 #9272

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

github-actions[bot] commented 2 years ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

github-actions[bot] commented 2 years ago

This issue was closed due to inactivity for 395 days.