trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 565 forks source link

Address failing testing test Belos_Tpetra_PseudoBlockCG_hb_test_MPI_4 in the debug builds on Power8 white and ride and Power9 waterman #2466

Closed bartlettroscoe closed 6 years ago

bartlettroscoe commented 6 years ago

CC: @trilinos/belos

Next Action Status

Since test was disabled in commit a68547f, no recent signs of this test failure.

Description

As shown at:

the test Belos_Tpetra_PseudoBlockCG_hb_test_MPI_4 fails in the builds:

run on white and ride and passes in every other build of Trilinos, including, ironically, the opt builds on white and ride which otherwise show a lot of failing Belos tests as described in #2454. This failing test for the cuda-debug build shows a setfault:

Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name white24 and rank 3!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name white24 and rank 1!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name white24 and rank 2!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name white24 and rank 0!
[white24:56629] *** Process received signal ***
[white24:56629] Signal: Segmentation fault (11)
[white24:56629] Signal code: Invalid permissions (2)
[white24:56629] Failing at address: 0x3fffd33fb038
...

and for the gnu-debug-openmp build shows:

Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name white24 and rank 3!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name white24 and rank 1!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name white24 and rank 2!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name white24 and rank 0!
[white24:56629] *** Process received signal ***
[white24:56629] Signal: Segmentation fault (11)
[white24:56629] Signal code: Invalid permissions (2)
[white24:56629] Failing at address: 0x3fffd33fb038
...

Related Issues:

bartlettroscoe commented 6 years ago

I think with the timing out KokkosKernels test being addressed in #2439, I think this failing test is the last failing test blocking the promotion of the build Trilinos-atdm-white-ride-cuda-debug to the "ATDM" CDash Track/Group. This is especially important because this build is being targeted for an auto PR testing build for Trilinos as described in #2464.

Therefore, I am going to go ahead and disable this test for these builds. Then someone with the interest can try to see why these tests are segfaulting. But given the problems we are seeing on this platform like described in #1208, that my not be worth it. And beside, this Power8 platform is just a stepping stone to the Power9 platform target for the ATS-2 machine Sierra so no reason to kill ourselves with this.

mhoemmen commented 6 years ago

@bartlettroscoe Do we have a list somewhere of "tests that we disabled because they are blocking CUDA builds"? I'm just a bit worried that we might lose track of what's failing.

rppawlo commented 6 years ago

A good addition to tribits would be a cmake function that disables tests but allows you to query for all disabled tests at configure time.

TRIBITS_ADD_TEST( ... DISABLED white,opt)

Then at configure time: -D <project_name|package_name>_SHOW_DISABLED_TESTS

This way we could very quickly get a sense of what works and what doesn't without having to dig through tickets.

bartlettroscoe commented 6 years ago

Do we have a list somewhere of "tests that we disabled because they are blocking CUDA builds"? I'm just a bit worried that we might lose track of what's failing.

@mhoemmen, yes. Short-term you can just grep the tweaks files:

$ find cmake/std/atdm/ -name "*.cmake" -exec grep -nH "DISABLE" {} \; | grep -i cuda
cmake/std/atdm/ride/tweaks/CUDA-DEBUG-CUDA.cmake:4:ATDM_SET_ENABLE(TeuchosNumerics_LAPACK_test_MPI_1_DISABLE ON)
cmake/std/atdm/ride/tweaks/CUDA-DEBUG-CUDA.cmake:7:ATDM_SET_ENABLE(Belos_Tpetra_PseudoBlockCG_hb_test_MPI_4_DISABLE ON)
cmake/std/atdm/ride/tweaks/CUDA-RELEASE-CUDA.cmake:4:ATDM_SET_ENABLE(PanzerAdaptersSTK_main_driver_energy-ss-loca-eigenvalue_DISABLE ON)
cmake/std/atdm/ride/tweaks/CUDA_COMMON_TWEAKS.cmake:2:ATDM_SET_ENABLE(PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3_DISABLE ON)
cmake/std/atdm/shiller/tweaks/CUDA_COMMON_TWEAKS.cmake:2:ATDM_SET_ENABLE(PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3_DISABLE ON)
cmake/std/atdm/shiller/tweaks/CUDA_COMMON_TWEAKS.cmake:5:ATDM_SET_ENABLE(Anasazi_Epetra_BlockDavidson_auxtest_MPI_4_DISABLE ON)
cmake/std/atdm/shiller/tweaks/CUDA_COMMON_TWEAKS.cmake:8:ATDM_SET_ENABLE(Anasazi_Epetra_LOBPCG_auxtest_MPI_4_DISABLE ON)

(see explanation of this setup in https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#directory-structure-and-contents).

And if you look at the comments above these SET() statements, they list the GitHub issue IDs for each of these so it is easy to trace back to why they were disabled.

After we upgrade CMake and CDash, then these disabled tests will show up on CDash as "Not Run" tests with the Details field "Disabled" (but those tests will not trigger CDash error emails) and you will be able to query for "Disabled" tests to see them all. But that requires CMake 3.10+ and the upgraded CDash that we are evaluating in https://gitlab.kitware.com/snl/project-1/issues/33.

A good addition to tribits would be a cmake function that disables tests but allows you to query for all disabled tests at configure time.

@rppawlo, that basically already exists. You just set Trilinos_TRACE_ADD_TEST=ON and then grep for "NOT added". For example, for the cuda-debug configure of Belos this morning on 'white' on CDash at:

you can see:

-- Belos_Tpetra_PseudoBlockCG_hb_test_MPI_4: NOT added test because Belos_Tpetra_PseudoBlockCG_hb_test_MPI_4_DISABLE='ON'!

But if you only wanted to see disabled tests, we could add support for <project_name|package_name>_SHOW_DISABLED_TESTS=ON if that would be desired.

bartlettroscoe commented 6 years ago

On the topic of disabled tests and GitHub issues, an idea that occurred to me would be that instead of closing GitHub issues that resolved an issue by just disabling tests, we could instead add a labels called something like Disabled Tests and Stalled and then leave the issues open and then filter them out using -label:"Disabled Tests" in most views or specifically search for them using label:"Disabled Tests". That way, disabled tests could be searched for statically and in configure ouptut as I showed above and on CDash (after a CMake and CDash upgrade) and also in GitHub.

What do people think about that idea?

mhoemmen commented 6 years ago

@bartlettroscoe I LIKE THAT IDEA

I think @csiefer2 agrees :D

mhoemmen commented 6 years ago

If people choose to remove the test, that's a different thing -- it's like closing the issue with "wontfix".

bartlettroscoe commented 6 years ago

CC: @trilinos/framework

I added the labels "Disabled Tests" and "Stalled" and applied them to, for example, #2474. See the updated documentation on this at:

bartlettroscoe commented 6 years ago

After this test was disabled from these builds in the commit a68547f, from looking at this query, there is no sign of this test failing in any of the promoted "ATDM" CDash Group ATDM Trilinos builds recently (at least in the last month since 5/7/2018).

Therefore, I think we can close this issue.