Tpetra, Xpetra, Amesos2, MueLu, and PanzerAdaptersSTK_* tests failing in many ATDM cuda 9.2 builds with Kokkos view bounds errors starting before 2019-05-15

fryeguy52 commented 5 years ago

Bug Report

CC: @trilinos/panzer, @kddevin (Trilinos Data Services Product Lead), @srajama1 (Trilinos Linear Solver Data Services), @mperego (Trilinos Discretizations Product Lead), @bartlettroscoe, @fryeguy52

Next Action Status

Since PR #5346 was merged on 6/7/2019 which fixed a file read/write race in the test, there has only been one failing Panzer test on any ATDM Trilinos platform as of 6/11/2019 looking to be related. Also, on 6/11/2019 @bathmatt reported EMPIRE is not failing in a similar way in his recent tests. Next: Watch results over next few weeks to see if more random failures like this occur ...

Description

As shown in this query the tests:

PanzerAdaptersSTK_MixedPoissonExample
PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-1

are failing in the build:

Trilinos-atdm-waterman-cuda-9.2-opt

Additionally the test:

PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2

is failing in a different build on the same machine:

Trilinos-atdm-waterman_cuda-9.2_fpic_static_opt

Expand to see new commits on 2019-05-14

``` *** Base Git Repo: Trilinos 7b6d69a: Merge remote-tracking branch 'origin/develop' into atdm-nightly Author: Roscoe A. Bartlett Date: Mon May 13 21:05:15 2019 -0600 085e9d8: Merge Pull Request #5138 from trilinos/Trilinos/zoltan_fix5106 Author: trilinos-autotester Date: Mon May 13 18:57:23 2019 -0600 238800a: Merge pull request #5163 from kyungjoo-kim/fix-5148 Author: kyungjoo-kim Date: Mon May 13 15:05:45 2019 -0600 7b827c7: Tpetra: resolution to #5161 (#5162) Author: Tim Fuller Date: Mon May 13 14:57:31 2019 -0600 D packages/tpetra/core/src/Tpetra_Experimental_BlockMultiVector.cpp 925a0a7: Ifpack2 - fix for #5148 Author: Kyungjoo Kim Date: Mon May 13 11:52:27 2019 -0600 M packages/ifpack2/src/Ifpack2_BlockTriDiContainer_impl.hpp a847648: Made even bigger. Author: K. Devine Date: Wed May 8 10:48:35 2019 -0600 M packages/zoltan/src/driver/dr_main.c M packages/zoltan/src/driver/dr_mainCPP.cpp 9051d9f: zoltan: minor change to fix #5106 Author: K. Devine Date: Wed May 8 10:38:59 2019 -0600 M packages/zoltan/src/driver/dr_main.c M packages/zoltan/src/driver/dr_mainCPP.cpp ```

Current Status on CDash

Results for the current testing day

Steps to Reproduce

One should be able to reproduce this failure on waterman as described in:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md

More specifically, the commands given for waterman are provided at:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#waterman

The exact commands to reproduce this issue should be:

$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-waterman-cuda-9.2-opt
$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Panzer=ON \
 $TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -n 20 ctest -j20

rppawlo commented 5 years ago

Panzer has not changed recently. According to empire testing (thanks @jmgate !), these are the candidate commits that could have caused the new panzer failure.

* b3a8dc8 (kyukim@sandia.gov) Mon May 13 21:28:33 2019 -0600
|
 Merge pull request #5167 from kyungjoo-kim/ifpack2-develop
| Ifpack2 develop
* 085e9d8 (trilinos-autotester@trilinos.org)
 Mon May 13 18:57:23 2019 -0600
| Merge Pull Request #5138 from trilinos/Trilinos/zoltan_fix5106
| PR Title:
 zoltan: minor change to fix #5106
| PR Author: kddevin
* 238800a (kyukim@sandia.gov)
 Mon May 13 15:05:45 2019 -0600
| Merge pull request #5163 from kyungjoo-kim/fix-5148
| Ifpack2 - fix for
 #5148
* 7b827c7 (tjfulle@sandia.gov) Mon May 13 14:57:31 2019 -0600

 Tpetra: resolution to #5161 (#5162)

bartlettroscoe commented 5 years ago

@kyungjoo-kim, @tjfulle, do you guys know if you recent commits listed above might caused this?

kyungjoo-kim commented 5 years ago

@rppawlo @bartlettroscoe No, my commits are only intended for ifpack2 blocktridicontainer. The commits are not related to Panzer.

bartlettroscoe commented 5 years ago

Does Panzer use Ifpack2?

kyungjoo-kim commented 5 years ago

Panzer may use other ifpack2 components but it does not use blocktridicontainer solver. The solver I am working on is only used by SPARC.

mhoemmen commented 5 years ago

@rppawlo and I talked about this over e-mail. The issue is that Trilinos does not yet work correctly when deprecated Tpetra code is disabled (Tpetra_ENABLE_DEPRECATED_CODE:BOOL=OFF). See e.g., the following issues:

@trilinos/tpetra is working on fixing these. The work-around for now is not to disabled deprecated code.

bathmatt commented 5 years ago

@mhoemmen correct me if I'm wrong, but these failures don't have Tpetra_ENABLE_DEPRECATED_CODE:BOOL=OFF set,

rppawlo commented 5 years ago

The deprecated code is enabled as @bathmatt mentioned. So the errors from the tests are different. One test shows:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119
Traceback functionality not available

While the two other failures show:

terminate called after throwing an instance of 'std::runtime_error'
  what():  /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-opt/SRC_AND_BUILD/Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp:3958:

Throw number = 1

Throw test that evaluated to true: (makeIndicesLocalResult.first != 0)

Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete: (Process 1) When converting column indices from global to local, we encountered 462 indices that do not live in the column Map on this process.  That's too many to print.

[waterman2:05756] *** Process received signal ***

@mhoemmen - are there any changes that to tpetra in the last 2 days that might trigger this?

mhoemmen commented 5 years ago

I don't think so, but it's possible. @trilinos/tpetra

For debugging, to see if this is a Panzer issue, we could adjust that print threshold temporarily.

mhoemmen commented 5 years ago

Try also setting the environment variable TPETRA_DEBUG=1. In the worst case, we could also set TPETRA_VERBOSE=CrsGraph.

bathmatt commented 5 years ago

I'm seeing the second error Roger mentioned in EMPIRE now with the EMPlasma trilinos. So, this isn't a new bug, it is an older bug that is starting to pop up more often it looks like

bathmatt commented 5 years ago

My statement might be incorrect, I wiped everything clean and it looks like it isn't popping up anymore

rppawlo commented 5 years ago

After rebuilding from scratch, this looks like the parallel level is too high and the cuda card is running out of memory with multiple tests hitting the same card. In the steps to reproduce, the tests are run with "cmake -j20". I could not reproduce the errors running the tests manually or when the cmake parallel level was reduced. I think we run the other cuda machines at -j8. Maybe we need to do that here also?

bartlettroscoe commented 5 years ago

@rppawlo, looking at the Jenkins driver at:

https://jenkins-srn.sandia.gov/view/Trilinos%20ATDM/job/Trilinos-atdm-waterman-cuda-9.2-opt/278/consoleFull

it shows:

ATDM_CONFIG_CTEST_PARALLEL_LEVEL=8

Therefore, it is running them with ctest -j8.

But that may be too much for some of these Panzer tests?

rppawlo commented 5 years ago

But that may be too much for some of these Panzer tests?

I think that is ok. The instructions at the top of this ticket have -j20 so I assumed that is what the tests were running. With -j20 I see a bunch of failures. With -j8 nothing fails. Do the atdm build scripts wipe the build directory? Some of the reported failures went away for both Matt and I with a clean build.

bartlettroscoe commented 5 years ago

@rppawlo asked:

Do the atdm build scripts wipe the build directory? Some of the reported failures went away for both Matt and I with a clean build.

By default all of the nightly ATDM Trilinos builds build from scratch each day. We can look on Jenkins to see if that is the case to be sure. For example, at:

https://jenkins-srn.sandia.gov/view/Trilinos%20ATDM/job/Trilinos-atdm-waterman-cuda-9.2-opt/278/consoleFull

it shows:

09:30:08 Cleaning out binary directory '/home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-opt/SRC_AND_BUILD/BUILD' ...

and does not show any errors so I would assume that it is blowing away the directories.

fryeguy52 commented 5 years ago

It looks like these are also failing on non-waterman builds. There are 74 failing PanzerAdaptersSTK* tests between 4 builds between 2019-05-01 ans 2019-05-29 shown [here](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercount=7&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=buildname&compare2=64&value2=-white-ride-&field3=buildname&compare3=64&value3=-mutrino-&field4=testname&compare4=65&value4=PanzerAdaptersSTK&field5=status&compare5=61&value5=Failed&field6=buildstarttime&compare6=84&value6=2019-05-29T00%3A00%3A00&field7=buildstarttime&compare7=83&value7=2019-05-01T00%3A00%3A00)

Note that the above link filters out builds on white and ride because we have seen a lot of failures on those machines recently but these tests may be failing there too. failures on white/ride in the last 2 weeks

current 2 week history of failing PanzerAdaptersSTK* tests

rppawlo commented 5 years ago

All failures are in cuda builds using the tpetra deprecated dynamic profile. I've tracked the multiblock test failure to a separate issue and will push a fix shortly.

The majority of the random errors look to be in the fillComplete on the CrsMatrix. I have not had good luck reproducing in raw panzer tests. EMPIRE is also seeing similar failures and @bathmatt was able to get the following stack trace:

#0  0x0000000017f868bc in std::enable_if<Kokkos::is_view<Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> > >::value&&Kokkos::is_view<Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >::value, void>::type Tpetra::Distributor::doPosts<Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >(Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> > const&, Teuchos::ArrayView<unsigned long const> const&, Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Teuchos::ArrayView<unsigned long const> const&) ()
#1  0x0000000017f87420 in std::enable_if<Kokkos::is_view<Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> > >::value&&Kokkos::is_view<Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >::value, void>::type Tpetra::Distributor::doPostsAndWaits<Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >(Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> > const&, Teuchos::ArrayView<unsigned long const> const&, Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Teuchos::ArrayView<unsigned long const> const&) ()
#2  0x0000000017f8ce58 in Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doTransferNew(Tpetra::SrcDistObject const&, Tpetra::CombineMode, unsigned long, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Tpetra::Distributor&, Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::ReverseOption, bool, bool) ()
#3  0x0000000017f73bc0 in Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doTransfer(Tpetra::SrcDistObject const&, Tpetra::Details::Transfer<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > const&, char const*, Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::ReverseOption, Tpetra::CombineMode, bool) ()
#4  0x0000000017f6fd1c in Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doExport(Tpetra::SrcDistObject const&, Tpetra::Export<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > const&, Tpetra::CombineMode, bool) ()
#5  0x0000000016f3ff34 in Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::globalAssemble() ()
#6  0x0000000016f40d90 in Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete(Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > const> const&, Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > const> const&, Teuchos::RCP<Teuchos::ParameterList> const&) ()
#7  0x00000000130b5aa4 in panzer::BlockedTpetraLinearObjFactory<panzer::Traits, double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::buildTpetraGraph(int, int) const ()
#8  0x00000000130cf5d0 in panzer::BlockedTpetraLinearObjFactory<panzer::Traits, double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::getGraph(int, int) const ()
#9  0x00000000130ba304 in panzer::BlockedTpetraLinearObjFactory<panzer::Traits, double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::getTpetraMatrix(int, int) const ()
#10 0x0000000012fe0430 in panzer::L2Projection<int, long long>::buildMassMatrix(bool, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, double> > > const*) ()

The failures occur in different ways - maybe a race condition? Sometimes we see a raw seg fault and sometimes we get the following two different errors reported from tpetra:

terminate called after throwing an instance of 'std::runtime_error'
  what():  /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp:3958:

Throw number = 1

Throw test that evaluated to true: (makeIndicesLocalResult.first != 0)

Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete: (Process 3) When converting column indices from global to local, we encountered 72 indices that does not live in the column Map on this process.
(Process 3) Here are the bad global indices, listed by local row: 
(Process 3)  Local row 262 (global row 558): [550,551,560,561,666,667,668,669,670]
(Process 3)  Local row 263 (global row 559): [550,551,560,561,666,667,668,669,670]
(Process 3)  Local row 264 (global row 570): [562,563,572,573,686,687,688,689,690]
(Process 3)  Local row 265 (global row 571): [562,563,572,573,686,687,688,689,690]
(Process 3)  Local row 266 (global row 582): [574,575,584,585,706,707,708,709,710]
(Process 3)  Local row 267 (global row 583): [574,575,584,585,706,707,708,709,710]
(Process 3)  Local row 270 (global row 606): [598,599,608,609,746,747,748,749,750]
(Process 3)  Local row 271 (global row 607): [598,599,608,609,746,747,748,749,750]

Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ascicgpu14 and rank 2!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ascicgpu14 and rank 3!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ascicgpu14 and rank 0!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ascicgpu14 and rank 1!
terminate called after throwing an instance of 'std::runtime_error'
  what():  View bounds error of view MV::DualView ( -1 < 297 , 0 < 1 )
Traceback functionality not available

[ascicgpu14:80428] *** Process received signal ***
[ascicgpu14:80428] Signal: Aborted (6)
[ascicgpu14:80428] Signal code:  (-6)
[ascicgpu14:80428] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7f0ea620a5d0]
[ascicgpu14:80428] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f0ea55b7207]
[ascicgpu14:80428] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f0ea55b88f8]
[ascicgpu14:80428] [ 3] /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125)[0x7f0ea5efa695]
[ascicgpu14:80428] [ 4] /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/lib64/libstdc++.so.6(+0x8f316)[0x7f0ea5ef8316]
[ascicgpu14:80428] [ 5] /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/lib64/libstdc++.so.6(+0x8f361)[0x7f0ea5ef8361]
[ascicgpu14:80428] [ 6] /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/lib64/libstdc++.so.6(+0x8f614)[0x7f0ea5ef8614]
[ascicgpu14:80428] [ 7] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/kokkos/core/src/libkokkoscore.so.12(_ZN6Kokkos4Impl23throw_runtime_exceptionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x369)[0x7f0ea7e8c809]
[ascicgpu14:80428] [ 8] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/dof-mgr/src/libpanzer-dof-mgr.so.12(_ZN6panzer10DOFManagerIixE22buildTaggedMultiVectorERKNS1_18ElementBlockAccessE+0xb7b)[0x7f0ee9a45e0b]
[ascicgpu14:80428] [ 9] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/dof-mgr/src/libpanzer-dof-mgr.so.12(_ZN6panzer10DOFManagerIixE19buildGlobalUnknownsERKN7Teuchos3RCPIKNS_12FieldPatternEEE+0x2ac)[0x7f0ee9a4838c]
[ascicgpu14:80428] [10] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/dof-mgr/src/libpanzer-dof-mgr.so.12(_ZN6panzer10DOFManagerIixE19buildGlobalUnknownsEv+0x245)[0x7f0ee9a4b235]
[ascicgpu14:80428] [11] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/disc-fe/src/libpanzer-disc-fe.so.12(_ZNK6panzer17DOFManagerFactoryIixE24buildUniqueGlobalIndexerINS_10DOFManagerIixEEEEN7Teuchos3RCPINS_19UniqueGlobalIndexerIixEEEERKNS6_IKNS5_13OpaqueWrapperIP19ompi_communicator_tEEEERKSt6vectorINS6_INS_12PhysicsBlockEEESaISK_EERKNS6_INS_11ConnManagerEEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x642)[0x7f0eead9dd22]
[ascicgpu14:80428] [12] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/disc-fe/src/libpanzer-disc-fe.so.12(_ZNK6panzer17DOFManagerFactoryIixE24buildUniqueGlobalIndexerERKN7Teuchos3RCPIKNS2_13OpaqueWrapperIP19ompi_communicator_tEEEERKSt6vectorINS3_INS_12PhysicsBlockEEESaISE_EERKNS3_INS_11ConnManagerEEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x9)[0x7f0eead9e449]
[ascicgpu14:80428] [13] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/CurlLaplacianExample/PanzerAdaptersSTK_CurlLaplacianExample.exe[0x47aae0]
[ascicgpu14:80428] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0ea55a33d5]
[ascicgpu14:80428] [15] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/CurlLaplacianExample/PanzerAdaptersSTK_CurlLaplacianExample.exe[0x47c768]
[ascicgpu14:80428] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 80428 on node ascicgpu14 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

--------------------------------------------------------------------------------

TEST_2: Return code = 134
TEST_2: Pass criteria = Match REGEX {ALL PASSED: Tpetra} [FAILED]
TEST_2: Result = FAILED

================================================================================

@mhoemmen and @tjfulle - have there been any changes recently to tpetra that might cause these kinds of errors?

mhoemmen commented 5 years ago

Tpetra changes to that code haven't landed yet. Try a debug build, and set the environment variable TPETRA_DEBUG to 1. If that doesn't help, set the environment variable TPETRA_VERBOSE to 1.

bathmatt commented 5 years ago

I'm trying that now, have a test that fails in opt code 3 out of 4 runs, however, in debug it hasn't failed yet. Maybe this flag will show something

bathmatt commented 5 years ago

I have the log file for two failed runs, they are different, nothing is reported but it looks like a cuda mem error, should I rerun with verbose?

output2.txt output.txt

mhoemmen commented 5 years ago

@bathmatt The output looks like you set TPETRA_VERBOSE -- did you not?

My suspicion is that something in EMPIRE is giving a CrsGraph or Map a host pointer, when they expect a UVM pointer. I remember @rppawlo finding something like this before.

bathmatt commented 5 years ago

cuda memcheck shows a lot of

========= Program hit CUDA_ERROR_ALREADY_MAPPED (error 208) due to "resource already mapped" on CUDA API call to cuIpcOpenMemHandle.

errors in tpetra, any idea if this means anything?

here is an example, it is still running very slow

=========     Host Frame:/home/projects/ppc64le-pwr9-nvidia/openmpi/2.1.2/gcc/7.2.0/cuda/9.2.88/lib/libmpi.so.20 (PMPI_Send + 0x168) [0xb0178]
=========     Host Frame:../EMPIRE_PIC.exe (_ZNK7Teuchos7MpiCommIiE4sendEiPKcii + 0x30) [0x209d5a0]
=========     Host Frame:../EMPIRE_PIC.exe (_ZN7Teuchos4sendIicEEvPKT0_T_iiRKNS_4CommIS4_EE + 0xd8) [0x7806268]
=========     Host Frame:../EMPIRE_PIC.exe (_ZN6Tpetra11Distributor7doPostsIN6Kokkos4ViewIPKcJNS2_10LayoutLeftENS2_6DeviceINS2_4CudaENS2_9CudaSpaceEEENS2_12MemoryTraitsILj0EEEEEENS3_IPcJSA_vvEEEEENSt9enable_ifIXaasrNS2_7is_viewIT_EE5valuesrNSH_IT0_EE5valueEvE4typeERKSI_RKN7Teuchos9ArrayViewIKmEERKSK_SV_ + 0x1124) [0x7f6b324]
=========     Host Frame:../EMPIRE_PIC.exe (_ZN6Tpetra11Distributor15doPostsAndWaitsIN6Kokkos4ViewIPKcJNS2_10LayoutLeftENS2_6DeviceINS2_4CudaENS2_9CudaSpaceEEENS2_12MemoryTraitsILj0EEEEEENS3_IPcJSA_vvEEEEENSt9enable_ifIXaasrNS2_7is_viewIT_EE5valuesrNSH_IT0_EE5valueEvE4typeERKSI_RKN7Teuchos9ArrayViewIKmEERKSK_SV_ + 0x30) [0x7f6c040]
=========     Host Frame:../EMPIRE_PIC.exe (_ZN6Tpetra10DistObjectIcixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEE13doTransferNewERKNS_13SrcDistObjectENS_11CombineModeEmRKNS1_8DualViewIPKiNS1_6DeviceIS4_NS1_9CudaSpaceEEEvvEESK_SK_SK_RNS_11DistributorENS7_13ReverseOptionEbb + 0x1be8) [0x7fbf6d8]
=========     Host Frame:../EMPIRE_PIC.exe (_ZN6Tpetra10DistObjectIcixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEE10doTransferERKNS_13SrcDistObjectERKNS_7Details8TransferIixS6_EEPKcNS7_13ReverseOptionENS_11CombineModeEb + 0x4cc) [0x7fb6fcc]
=========     Host Frame:../EMPIRE_PIC.exe (_ZN6Tpetra10DistObjectIcixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEE8doExportERKNS_13SrcDistObjectERKNS_6ExportIixS6_EENS_11CombineModeEb + 0x3ec) [0x7fb4c5c]
=========     Host Frame:../EMPIRE_PIC.exe (_ZN6panzer12L2ProjectionIixE15buildMassMatrixEbPKSt13unordered_mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEdSt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_dEEE + 0xe74) [0x2fd9f34]
=========     Host Frame:../EMPIRE_PIC.exe (_ZN6empire7solvers30createMassMatrixFromDOFManagerERKN7Teuchos3RCPIN6panzer10DOFManagerIixEEEERKNS2_IN10panzer_stk13STK_InterfaceEEERKNS2_INS3_11ConnManagerEEEPKSt13unordered_mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEdSt4hashISO_ESt8equal_toISO_ESaISt4pairIKSO_dEEE + 0xa50) [0x2911eb0]
=========     Host Frame:../EMPIRE_PIC.exe (_ZN6empire30ElectroMagneticSolverInterfaceC1E18MainParameterListsNS_13MeshContainerERNS_5utils9TimeStampEb + 0x1ac8) [0x2a475b8]
=========     Host Frame:../EMPIRE_PIC.exe (_Z16meshSpecificMainIN6empire10MeshTraitsIN6shards11TetrahedronILj4EEELi1EEEEvRN7Teuchos3RCPINS6_12StackedTimerEEERKNS7_IKNS6_7MpiCommIiEEEEdR21MainPicParameterListsbRNS0_13MeshContainerERNS0_5utils9TimeStampE + 0x3d00) [0x211fd60]
=========     Host Frame:../EMPIRE_PIC.exe (main + 0xda8) [0x1f658e8]
=========     Host Frame:/lib64/libc.so.6 [0x25100]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xc4) [0x252f4]

mhoemmen commented 5 years ago

@bathmatt It smells like something in EMPIRE is giving Tpetra a host pointer wrapped in a device Kokkos::View. I am adding a function to Tpetra now that will diagnose and report this issue.

bartlettroscoe commented 5 years ago

@mhoemmen said

smells like something in EMPIRE is giving Tpetra a host pointer wrapped in a device Kokkos::View

I think this is occurring in Panzer tests as well so it is not unique to EMPIRE code.

bathmatt commented 5 years ago

THanks, how is this a heisenbug if it is that, wouldn't it always do that??

I can run, fail !!, pass, !! fail, ....

I'll take your patch and try it and see

bartlettroscoe commented 5 years ago

FYI: I had a talk with @rppawlo this morning about these failing Panzer tests. If you look at all of the ATDM Trilinos failing Panzer tests since 2019-06-01 you can see that after @rppawlo merge of PR #5346 on 6/7/2019 which fixed a file read/write race, there has only been to failing Panzer tests on any ATDM Trilinos platform have been:

The test PanzerAdaptersSTK_tSTK_IO_MPI_1 failed due to a random I/O failure:

Exodus Library Warning/Error: [ex_create_int]
    ERROR: file create failed for output.exo
ERROR: Could not open output database 'output.exo' of type 'exodus'
0. tSTK_IO_fields_UnitTest ... -------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------

which looks to be unrelated.

The only other failing test PanzerAdaptersSTK_MixedCurlLaplacianExample-ConvTest-Quad-Order-2 shows:

================================================================================

TEST_2

Running: "/home/projects/ppc64le-pwr8-nvidia/openmpi/2.1.2/gcc/7.2.0/cuda/9.2.88/bin/mpiexec" "-np" "4" "-map-by" "socket:PE=4" "/home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/MixedCurlLaplacianExample/PanzerAdaptersSTK_MixedCurlLaplacianExample.exe" "--use-tpetra" "--use-twod" "--cell=Quad" "--x-elements=16" "--y-elements=16" "--z-elements=4" "--basis-order=2"

  Writing output to file "/home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/MixedCurlLaplacianExample/MPE-ConvTest-Quad-2-16"

--------------------------------------------------------------------------------

Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ride11 and rank 0!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ride11 and rank 1!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ride11 and rank 2!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ride11 and rank 3!
:0: : block: [1,0,0], thread: [0,75,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
...
:0: : block: [2,0,0], thread: [0,158,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaGetLastError() error( cudaErrorAssert): device-side assert triggered /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp:401
Traceback functionality not available

[ride11:53997] *** Process received signal ***
[ride11:53997] Signal: Aborted (6)
[ride11:53997] Signal code:  (-6)
[ride11:53997] [ 0] [0x3fff8da20478]
[ride11:53997] [ 1] /lib64/libc.so.6(abort+0x2b4)[0x3fff80141f94]
...

@rppawlo thinks this might be due to a race condition but there is not enough context to determine where and he can't reproduce the problem himself.

To put this in context, a shown in this query there have been over 11K Panzer tests run across all ATDM CUDA builds since 6/7/2019 and only one fails in this way! So we should expect that this failure would be very hard to reproduce.

What we really need is for someone to add a stack trace when an error like:

:0: : block: [1,0,0], thread: [0,75,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.

occurs. Then, when this occurs due to a rare random event, we see where this is happening.

For example, we have code in Teuchos that uses BinUtils functions to generate a readable stack trace whenever we want to as long as the code was built with -g. This code has zero overhead until you ask for the stack trace (unlike a verbose printing option like Tpetra has).

Why can't we create a stack track like this with Kokkos and Tpetra code? That would seem to be a critical bit of functionality to debug rare failures like this.

Otherwise, since this this is just very rarely randomly failing, let's see what happens over the next few weeks to gather more data.

In the meantime, @rppawlo is going to see if he can reproduce this with EMPIRE since @bathmatt reported that he can get EMPIRE to fail in a similar way about 1/3 of the time.

rppawlo commented 5 years ago

This morning @bathmatt reran the empire problem that was exhibiting this issue and he can't reproduce it either anymore.

bartlettroscoe commented 5 years ago

@rppawlo said:

This morning @bathmatt reran the empire problem that was exhibiting this issue and he can't reproduce it either anymore.

Then let's just put this Issue in review and watch it for a bit.

mhoemmen commented 5 years ago

@bartlettroscoe wrote:

Why can't we create a stack track like this with Kokkos and Tpetra code? That would seem to be a critical bit of functionality to debug rare failures like this.

I'm pretty sure that "API to get stack trace of device code" isn't a thing. The best we could do is catch those errors at the kernel launch level. Would that be OK?

bartlettroscoe commented 5 years ago

@mhoemmen said:

I'm pretty sure that "API to get stack trace of device code" isn't a thing. The best we could do is catch those errors at the kernel launch level. Would that be OK?

Yea, getting the stack trace for the host code would be a huge help I think to start to track this down. Can Kokkos be made to use the TEuchos stack trace stuff? Any code using TEUCHOS_TEST_FOR_EXCEPTION() and TEUCHOS_STANDARD_CATCH() automatically has this functionality.

mhoemmen commented 5 years ago

@bartlettroscoe wrote:

Can Kokkos be made to use the Teuchos stack trace stuff?

Using the macros directly would introduce a circular package dependency (Teuchos already depends optionally on Kokkos). However, Kokkos could perfectly well reimplement that stuff (e.g., contents of Teuchos_stacktrace.cpp).

mhoemmen commented 5 years ago

@bartlettroscoe I added a Kokkos feature request that points back to the discussion here. Does Teuchos normally pull in that BFD stuff (in Teuchos_stacktrace.cpp)? Is it needed for an effective stack trace?

bartlettroscoe commented 5 years ago

@mhoemmen asked:

Does Teuchos normally pull in that BFD stuff (in Teuchos_stacktrace.cpp)? Is it needed for an effective stack trace?

I think yes. You need to provide file names and line numbers and demangled names. That is what that code does.

I added a Kokkos feature request that points back to the discussion here.

https://github.com/kokkos/kokkos/issues/2173

bartlettroscoe commented 5 years ago

Using the macros directly would introduce a circular package dependency (Teuchos already depends optionally on Kokkos).

Perhaps, but it could be done through dependency inversion and dependency injection. This would allow users to inject any behavior they wanted when Kokkos raised an error and decided to abort. If you think about it, every library that aborts execution should provide a hook for these cases. The C++ standard library does.

However, Kokkos could perfectly well reimplement that stuff (e.g., contents of Teuchos_stacktrace.cpp).

Fine with me. Just make sure they write strong automated tests like exists in Teuchos.

mhoemmen commented 5 years ago

If you think about it, every library that aborts execution should provide a hook for these cases.

Kokkos does actually; it's called push_finalize_hook. (I added it :-) .)

bartlettroscoe commented 5 years ago

@mhoemmen, we need to change Kokkos to support a mode so that instead of throwing std::runtime_error, it gathers and prints a stacktrace and aborts. We should not underestimate how huge of an impact functionality like this will have on enabling the debug of random errors like this.

bartlettroscoe commented 5 years ago

Here is another one for the test PanzerAdaptersSTK_MixedCurlLaplacianMultiblockExample-ConvTest-Quad-Order-1 in the build Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug on 'ride' yesterday:

https://testing.sandia.gov/cdash/testDetails.php?test=80972347&build=5168798

showing:

TEST_2

Running: "/home/projects/ppc64le-pwr8-nvidia/openmpi/2.1.2/gcc/7.2.0/cuda/9.2.88/bin/mpiexec" "-np" "4" "-map-by" "socket:PE=4" "/home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/MixedCurlLaplacianExample/PanzerAdaptersSTK_MixedCurlLaplacianExample.exe" "--use-tpetra" "--use-twod" "--x-blocks=2" "--cell=Quad" "--x-elements=32" "--y-elements=32" "--z-elements=4" "--basis-order=1" "--output-filename=multiblock-"

  Writing output to file "/home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/MixedCurlLaplacianExample/MPE-Multiblock-ConvTest-Quad-1-32"

--------------------------------------------------------------------------------

Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ride13 and rank 0!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ride13 and rank 1!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ride13 and rank 2!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ride13 and rank 3!
:0: : block: [1,0,0], thread: [0,224,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
:0: : block: [1,0,0], thread: [0,225,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
:0: : block: [1,0,0], thread: [0,227,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
:0: : block: [1,0,0], thread: [0,228,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
:0: : block: [1,0,0], thread: [0,230,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
:0: : block: [1,0,0], thread: [0,231,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
...
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaGetLastError() error( cudaErrorAssert): device-side assert triggered /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp:401
Traceback functionality not available

[ride13:97525] *** Process received signal ***
[ride13:97525] Signal: Aborted (6)
[ride13:97525] Signal code:  (-6)
[ride13:97525] [ 0] [0x3fff97370478]

@crtrott and @mhoemmen, who has time on the Kokkos team put in the hooks for these failures to generate a stacktrace when random failures like this occur?

mhoemmen commented 5 years ago

@bartlettroscoe Both of us have a deadline this week. I'll be happy to work on this afterwards.

bartlettroscoe commented 5 years ago

@mhoemmen said:

Both of us have a deadline this week. I'll be happy to work on this afterwards.

That is fine. As far as I know, this is not super urgent (but it is important). @rppawlo has spent days trying to reproduce this and can't.

(Saving days of @rppawlo's and other developer's time tracking down random failures coming out of Kokkos would likely be classified as important :-) )

bartlettroscoe commented 5 years ago

@rppawlo, @mhoemmen, @trilinos/tpetra,

We are also seeing the test TpetraCore_ImportExport2_UnitTests_MPI_4 randomly failing as well 3 times in the last 30 days as shown here showing the same Assertion 'View bounds error of view FixedHashTable::pairs' failed error like:

0: : block: [4,0,0], thread: [0,32,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
...
:0: : block: [4,0,0], thread: [0,159,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
[ascicgpu14:08319] *** Process received signal ***
[ascicgpu14:08319] Signal: Floating point exception (8)
[ascicgpu14:08319] Signal code: Integer divide-by-zero (1)
[ascicgpu14:08319] Failing at address: 0x7f76a9998261

The final error shows:

[ascicgpu14:08319] Signal: Floating point exception (8)
[ascicgpu14:08319] Signal code: Integer divide-by-zero (1)

instead of an exception being thrown.

What is the chance these random TpetraCore test failures are not related to these random failures seen in Panzer?

bartlettroscoe commented 5 years ago

NOTE: The were only two other Tpetra test failures in the last 30 days on any of the CUDA builds as shown here but they look to be unrelated or at least they don't show this view bounds error.

bartlettroscoe commented 5 years ago

@trilinos/xpetra

We also saw this Assertion 'View bounds error of view FixedHashTable::pairs' failed error in the test Xpetra_BlockedCrsMatrix_UnitTests_MPI_4 in the same build Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug shown here showing:

27. BlockedCrsMatrix_double_int_int_Kokkos_Compat_KokkosCudaWrapperNode_ReorderBlockOperatorThyra_UnitTest ... :0: : block: [0,0,0], thread: [0,64,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
:0: : block: [1,0,0], thread: [0,64,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
...
:0: : block: [0,0,0], thread: [0,31,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
[ascicgpu14:14395] *** Process received signal ***
[ascicgpu14:14395] Signal: Floating point exception (8)
[ascicgpu14:14395] Signal code: Integer divide-by-zero (1)

mhoemmen commented 5 years ago

Huh, these look rather new. @trilinos/tpetra

bartlettroscoe commented 5 years ago

FYI: As shown this this long query we are not only seeing these failures in the build Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug (but most are).

We also saw this same "View Bounds" error in the test:

PanzerAdaptersSTK_MixedCurlLaplacianMultiblockExample-ConvTest-Quad-Order-1

in the build:

Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug

on 'ride' on 2019-06-12 showing:

:0: : block: [1,0,0], thread: [0,224,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
...
:0: : block: [0,0,0], thread: [0,95,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaGetLastError() error( cudaErrorAssert): device-side assert triggered /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp:401
Traceback functionality not available

[ride13:97525] *** Process received signal ***
[ride13:97525] Signal: Aborted (6)
[ride13:97525] Signal code:  (-6)
[ride13:97525] [ 0] [0x3fff97370478]

If this is related to overloading the GPUs, how could we tell?

bartlettroscoe commented 5 years ago

CC: @trilinos/muelu, @srajama1 (Trilinos Linear Solvers Product Area Lead)

We also saw this error in the test:

MueLu_UnitTestsBlockedTpetra_MPI_4

in the build:

Trilinos-atdm-waterman-cuda-9.2-release-debug

on 'waterman' on 2019-06-19 on:

10. BlockedDirectSolver_double_int_int_Kokkos_Compat_KokkosCudaWrapperNode_NestedI53I42II01II_Thyra_BlockedDirectSolver_Setup_Apply_UnitTest ... MueLu::Amesos2Smoother: using "Klu"
Setup BlockedDirectSolver (MueLu::BlockedDirectSolver{type = blocked direct solver})
Setup Smoother (MueLu::Amesos2Smoother{type = Klu})
MergedBlockedMatrix (MueLu::MergedBlockedMatrixFactory)
A (merged) size =  640 x 640, nnz = 640
[empty list]

ReorderBlockA factory (MueLu::ReorderBlockAFactory)
Got a 6x6 blocked operator as input
Reordering A using [ 5  3  [ 4  2 ]  [ 0  1 ] ] block gives a 4x4 blocked operators
Reorder Type = [5 3 [4 2] [0 1]]

:0: : block: [0,0,0], thread: [0,96,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
...
:0: : block: [1,0,0], thread: [0,155,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorAssert): device-side assert triggered /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119
Traceback functionality not available

[waterman2:103528] *** Process received signal ***
[waterman2:103528] Signal: Aborted (6)
[waterman2:103528] Signal code:  (-6)
[waterman2:103528] [ 0] [0x7fff895104d8]

This this is not isolated to the machine 'ascicgpu14' running the 'sems-rhel7' builds (mostly n the Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug build).

Need some more info from Kokkos to help us debug these failures.

bartlettroscoe commented 5 years ago

FYI: Not a fluke for MueLu_UnitTestsBlockedTpetra_MPI_4 failure as it happened on 6/20/2019 as shown here (but not today). That was two days in a row on the same machine.

@ibaned, @crtrott, @nmhamster, @rppawlo,

Is there any chance that overloading the GPU could cause errors like this or is it more like a code defect (with a race condition)?

But all of these failures are in Tpetra tests or in tests downstream from Tpetra. If it was a problem with overloading the GPU would that not also allow native Kokkos tests to fail randomly too?

bartlettroscoe commented 5 years ago

@trilinos/amesos2, @srajama1 (Trilinos Linear Solvers Product Lead)

As shown in this query the test:

Amesos2_SuperLU_DIST_Solver_Test_MPI_4

in the build:

Trilinos-atdm-waterman-cuda-9.2-release-debug

has randomly failed 3 times in the last 12 days and is showing failures like shown here showing:

:0: : block: [0,0,0], thread: [0,96,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
...
:0: : block: [0,0,0], thread: [0,95,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaGetLastError() error( cudaErrorAssert): device-side assert triggered /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp:401
Traceback functionality not available

[waterman2:27305] *** Process received signal ***
[waterman2:27305] Signal: Aborted (6)
[waterman2:27305] Signal code: User function (kill, sigsend, abort, etc.) (0)
[waterman2:27305] [ 0] [0x7fff804b04d8]

srajama1 commented 5 years ago

@bartlettroscoe : This feels like an environment issue or something new check in Tpetra. I will add @kddevin so she is aware of it.

bartlettroscoe commented 5 years ago

@srajama1 said:

This feels like an environment issue or something new check in Tpetra. I will add @kddevin so she is aware of it.

As far as I know, the env on 'waterman' has not changed in a long time. Therefore, if it is an env issue, then a change in Trilinos is triggering this. Because these are all random failures, it is hard to determine what change might have triggered this. One thing all of these failing tests have in common is that they all use Tpetra. If we could get a stack trace out of Kokkos when this occurs, then we would be able to start narrowing this down perhaps.

Otherwise, we will just continue to monitor these failures. As long as the APPs are not seeing these failures, this is not high urgency to address. (It just creates extra noise in our triagging progress but I think we can extend our tools to more effectively filter based on this error.)

trilinos / Trilinos