Closed fryeguy52 closed 4 years ago
Panzer has not changed recently. According to empire testing (thanks @jmgate !), these are the candidate commits that could have caused the new panzer failure.
* b3a8dc8 (kyukim@sandia.gov) Mon May 13 21:28:33 2019 -0600
|
Merge pull request #5167 from kyungjoo-kim/ifpack2-develop
| Ifpack2 develop
* 085e9d8 (trilinos-autotester@trilinos.org)
Mon May 13 18:57:23 2019 -0600
| Merge Pull Request #5138 from trilinos/Trilinos/zoltan_fix5106
| PR Title:
zoltan: minor change to fix #5106
| PR Author: kddevin
* 238800a (kyukim@sandia.gov)
Mon May 13 15:05:45 2019 -0600
| Merge pull request #5163 from kyungjoo-kim/fix-5148
| Ifpack2 - fix for
#5148
* 7b827c7 (tjfulle@sandia.gov) Mon May 13 14:57:31 2019 -0600
Tpetra: resolution to #5161 (#5162)
@kyungjoo-kim, @tjfulle, do you guys know if you recent commits listed above might caused this?
@rppawlo @bartlettroscoe No, my commits are only intended for ifpack2 blocktridicontainer. The commits are not related to Panzer.
Does Panzer use Ifpack2?
Panzer may use other ifpack2 components but it does not use blocktridicontainer solver. The solver I am working on is only used by SPARC.
@rppawlo and I talked about this over e-mail. The issue is that Trilinos does not yet work correctly when deprecated Tpetra code is disabled (Tpetra_ENABLE_DEPRECATED_CODE:BOOL=OFF
). See e.g., the following issues:
@trilinos/tpetra is working on fixing these. The work-around for now is not to disabled deprecated code.
@mhoemmen correct me if I'm wrong, but these failures don't have Tpetra_ENABLE_DEPRECATED_CODE:BOOL=OFF set,
The deprecated code is enabled as @bathmatt mentioned. So the errors from the tests are different. One test shows:
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119
Traceback functionality not available
While the two other failures show:
terminate called after throwing an instance of 'std::runtime_error'
what(): /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-opt/SRC_AND_BUILD/Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp:3958:
Throw number = 1
Throw test that evaluated to true: (makeIndicesLocalResult.first != 0)
Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete: (Process 1) When converting column indices from global to local, we encountered 462 indices that do not live in the column Map on this process. That's too many to print.
[waterman2:05756] *** Process received signal ***
@mhoemmen - are there any changes that to tpetra in the last 2 days that might trigger this?
I don't think so, but it's possible. @trilinos/tpetra
For debugging, to see if this is a Panzer issue, we could adjust that print threshold temporarily.
Try also setting the environment variable TPETRA_DEBUG=1
. In the worst case, we could also set TPETRA_VERBOSE=CrsGraph
.
I'm seeing the second error Roger mentioned in EMPIRE now with the EMPlasma trilinos. So, this isn't a new bug, it is an older bug that is starting to pop up more often it looks like
My statement might be incorrect, I wiped everything clean and it looks like it isn't popping up anymore
After rebuilding from scratch, this looks like the parallel level is too high and the cuda card is running out of memory with multiple tests hitting the same card. In the steps to reproduce, the tests are run with "cmake -j20". I could not reproduce the errors running the tests manually or when the cmake parallel level was reduced. I think we run the other cuda machines at -j8. Maybe we need to do that here also?
@rppawlo, looking at the Jenkins driver at:
it shows:
ATDM_CONFIG_CTEST_PARALLEL_LEVEL=8
Therefore, it is running them with ctest -j8
.
But that may be too much for some of these Panzer tests?
But that may be too much for some of these Panzer tests?
I think that is ok. The instructions at the top of this ticket have -j20 so I assumed that is what the tests were running. With -j20 I see a bunch of failures. With -j8 nothing fails. Do the atdm build scripts wipe the build directory? Some of the reported failures went away for both Matt and I with a clean build.
@rppawlo asked:
Do the atdm build scripts wipe the build directory? Some of the reported failures went away for both Matt and I with a clean build.
By default all of the nightly ATDM Trilinos builds build from scratch each day. We can look on Jenkins to see if that is the case to be sure. For example, at:
it shows:
09:30:08 Cleaning out binary directory '/home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-opt/SRC_AND_BUILD/BUILD' ...
and does not show any errors so I would assume that it is blowing away the directories.
It looks like these are also failing on non-waterman builds. There are 74 failing PanzerAdaptersSTK* tests between 4 builds between 2019-05-01 ans 2019-05-29 shown [here](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercount=7&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=buildname&compare2=64&value2=-white-ride-&field3=buildname&compare3=64&value3=-mutrino-&field4=testname&compare4=65&value4=PanzerAdaptersSTK&field5=status&compare5=61&value5=Failed&field6=buildstarttime&compare6=84&value6=2019-05-29T00%3A00%3A00&field7=buildstarttime&compare7=83&value7=2019-05-01T00%3A00%3A00)
Note that the above link filters out builds on white and ride because we have seen a lot of failures on those machines recently but these tests may be failing there too. failures on white/ride in the last 2 weeks
All failures are in cuda builds using the tpetra deprecated dynamic profile. I've tracked the multiblock test failure to a separate issue and will push a fix shortly.
The majority of the random errors look to be in the fillComplete on the CrsMatrix. I have not had good luck reproducing in raw panzer tests. EMPIRE is also seeing similar failures and @bathmatt was able to get the following stack trace:
#0 0x0000000017f868bc in std::enable_if<Kokkos::is_view<Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> > >::value&&Kokkos::is_view<Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >::value, void>::type Tpetra::Distributor::doPosts<Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >(Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> > const&, Teuchos::ArrayView<unsigned long const> const&, Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Teuchos::ArrayView<unsigned long const> const&) ()
#1 0x0000000017f87420 in std::enable_if<Kokkos::is_view<Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> > >::value&&Kokkos::is_view<Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >::value, void>::type Tpetra::Distributor::doPostsAndWaits<Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >(Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> > const&, Teuchos::ArrayView<unsigned long const> const&, Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Teuchos::ArrayView<unsigned long const> const&) ()
#2 0x0000000017f8ce58 in Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doTransferNew(Tpetra::SrcDistObject const&, Tpetra::CombineMode, unsigned long, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Tpetra::Distributor&, Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::ReverseOption, bool, bool) ()
#3 0x0000000017f73bc0 in Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doTransfer(Tpetra::SrcDistObject const&, Tpetra::Details::Transfer<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > const&, char const*, Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::ReverseOption, Tpetra::CombineMode, bool) ()
#4 0x0000000017f6fd1c in Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doExport(Tpetra::SrcDistObject const&, Tpetra::Export<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > const&, Tpetra::CombineMode, bool) ()
#5 0x0000000016f3ff34 in Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::globalAssemble() ()
#6 0x0000000016f40d90 in Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete(Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > const> const&, Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > const> const&, Teuchos::RCP<Teuchos::ParameterList> const&) ()
#7 0x00000000130b5aa4 in panzer::BlockedTpetraLinearObjFactory<panzer::Traits, double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::buildTpetraGraph(int, int) const ()
#8 0x00000000130cf5d0 in panzer::BlockedTpetraLinearObjFactory<panzer::Traits, double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::getGraph(int, int) const ()
#9 0x00000000130ba304 in panzer::BlockedTpetraLinearObjFactory<panzer::Traits, double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::getTpetraMatrix(int, int) const ()
#10 0x0000000012fe0430 in panzer::L2Projection<int, long long>::buildMassMatrix(bool, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, double> > > const*) ()
The failures occur in different ways - maybe a race condition? Sometimes we see a raw seg fault and sometimes we get the following two different errors reported from tpetra:
terminate called after throwing an instance of 'std::runtime_error'
what(): /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp:3958:
Throw number = 1
Throw test that evaluated to true: (makeIndicesLocalResult.first != 0)
Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete: (Process 3) When converting column indices from global to local, we encountered 72 indices that does not live in the column Map on this process.
(Process 3) Here are the bad global indices, listed by local row:
(Process 3) Local row 262 (global row 558): [550,551,560,561,666,667,668,669,670]
(Process 3) Local row 263 (global row 559): [550,551,560,561,666,667,668,669,670]
(Process 3) Local row 264 (global row 570): [562,563,572,573,686,687,688,689,690]
(Process 3) Local row 265 (global row 571): [562,563,572,573,686,687,688,689,690]
(Process 3) Local row 266 (global row 582): [574,575,584,585,706,707,708,709,710]
(Process 3) Local row 267 (global row 583): [574,575,584,585,706,707,708,709,710]
(Process 3) Local row 270 (global row 606): [598,599,608,609,746,747,748,749,750]
(Process 3) Local row 271 (global row 607): [598,599,608,609,746,747,748,749,750]
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ascicgpu14 and rank 2!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ascicgpu14 and rank 3!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ascicgpu14 and rank 0!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ascicgpu14 and rank 1!
terminate called after throwing an instance of 'std::runtime_error'
what(): View bounds error of view MV::DualView ( -1 < 297 , 0 < 1 )
Traceback functionality not available
[ascicgpu14:80428] *** Process received signal ***
[ascicgpu14:80428] Signal: Aborted (6)
[ascicgpu14:80428] Signal code: (-6)
[ascicgpu14:80428] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7f0ea620a5d0]
[ascicgpu14:80428] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f0ea55b7207]
[ascicgpu14:80428] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f0ea55b88f8]
[ascicgpu14:80428] [ 3] /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125)[0x7f0ea5efa695]
[ascicgpu14:80428] [ 4] /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/lib64/libstdc++.so.6(+0x8f316)[0x7f0ea5ef8316]
[ascicgpu14:80428] [ 5] /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/lib64/libstdc++.so.6(+0x8f361)[0x7f0ea5ef8361]
[ascicgpu14:80428] [ 6] /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/lib64/libstdc++.so.6(+0x8f614)[0x7f0ea5ef8614]
[ascicgpu14:80428] [ 7] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/kokkos/core/src/libkokkoscore.so.12(_ZN6Kokkos4Impl23throw_runtime_exceptionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x369)[0x7f0ea7e8c809]
[ascicgpu14:80428] [ 8] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/dof-mgr/src/libpanzer-dof-mgr.so.12(_ZN6panzer10DOFManagerIixE22buildTaggedMultiVectorERKNS1_18ElementBlockAccessE+0xb7b)[0x7f0ee9a45e0b]
[ascicgpu14:80428] [ 9] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/dof-mgr/src/libpanzer-dof-mgr.so.12(_ZN6panzer10DOFManagerIixE19buildGlobalUnknownsERKN7Teuchos3RCPIKNS_12FieldPatternEEE+0x2ac)[0x7f0ee9a4838c]
[ascicgpu14:80428] [10] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/dof-mgr/src/libpanzer-dof-mgr.so.12(_ZN6panzer10DOFManagerIixE19buildGlobalUnknownsEv+0x245)[0x7f0ee9a4b235]
[ascicgpu14:80428] [11] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/disc-fe/src/libpanzer-disc-fe.so.12(_ZNK6panzer17DOFManagerFactoryIixE24buildUniqueGlobalIndexerINS_10DOFManagerIixEEEEN7Teuchos3RCPINS_19UniqueGlobalIndexerIixEEEERKNS6_IKNS5_13OpaqueWrapperIP19ompi_communicator_tEEEERKSt6vectorINS6_INS_12PhysicsBlockEEESaISK_EERKNS6_INS_11ConnManagerEEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x642)[0x7f0eead9dd22]
[ascicgpu14:80428] [12] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/disc-fe/src/libpanzer-disc-fe.so.12(_ZNK6panzer17DOFManagerFactoryIixE24buildUniqueGlobalIndexerERKN7Teuchos3RCPIKNS2_13OpaqueWrapperIP19ompi_communicator_tEEEERKSt6vectorINS3_INS_12PhysicsBlockEEESaISE_EERKNS3_INS_11ConnManagerEEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x9)[0x7f0eead9e449]
[ascicgpu14:80428] [13] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/CurlLaplacianExample/PanzerAdaptersSTK_CurlLaplacianExample.exe[0x47aae0]
[ascicgpu14:80428] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0ea55a33d5]
[ascicgpu14:80428] [15] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/CurlLaplacianExample/PanzerAdaptersSTK_CurlLaplacianExample.exe[0x47c768]
[ascicgpu14:80428] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 80428 on node ascicgpu14 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
--------------------------------------------------------------------------------
TEST_2: Return code = 134
TEST_2: Pass criteria = Match REGEX {ALL PASSED: Tpetra} [FAILED]
TEST_2: Result = FAILED
================================================================================
@mhoemmen and @tjfulle - have there been any changes recently to tpetra that might cause these kinds of errors?
Tpetra changes to that code haven't landed yet. Try a debug build, and set the environment variable TPETRA_DEBUG
to 1. If that doesn't help, set the environment variable TPETRA_VERBOSE
to 1.
I'm trying that now, have a test that fails in opt code 3 out of 4 runs, however, in debug it hasn't failed yet. Maybe this flag will show something
I have the log file for two failed runs, they are different, nothing is reported but it looks like a cuda mem error, should I rerun with verbose?
@bathmatt The output looks like you set TPETRA_VERBOSE
-- did you not?
My suspicion is that something in EMPIRE is giving a CrsGraph or Map a host pointer, when they expect a UVM pointer. I remember @rppawlo finding something like this before.
cuda memcheck shows a lot of
========= Program hit CUDA_ERROR_ALREADY_MAPPED (error 208) due to "resource already mapped" on CUDA API call to cuIpcOpenMemHandle.
errors in tpetra, any idea if this means anything?
here is an example, it is still running very slow
========= Host Frame:/home/projects/ppc64le-pwr9-nvidia/openmpi/2.1.2/gcc/7.2.0/cuda/9.2.88/lib/libmpi.so.20 (PMPI_Send + 0x168) [0xb0178]
========= Host Frame:../EMPIRE_PIC.exe (_ZNK7Teuchos7MpiCommIiE4sendEiPKcii + 0x30) [0x209d5a0]
========= Host Frame:../EMPIRE_PIC.exe (_ZN7Teuchos4sendIicEEvPKT0_T_iiRKNS_4CommIS4_EE + 0xd8) [0x7806268]
========= Host Frame:../EMPIRE_PIC.exe (_ZN6Tpetra11Distributor7doPostsIN6Kokkos4ViewIPKcJNS2_10LayoutLeftENS2_6DeviceINS2_4CudaENS2_9CudaSpaceEEENS2_12MemoryTraitsILj0EEEEEENS3_IPcJSA_vvEEEEENSt9enable_ifIXaasrNS2_7is_viewIT_EE5valuesrNSH_IT0_EE5valueEvE4typeERKSI_RKN7Teuchos9ArrayViewIKmEERKSK_SV_ + 0x1124) [0x7f6b324]
========= Host Frame:../EMPIRE_PIC.exe (_ZN6Tpetra11Distributor15doPostsAndWaitsIN6Kokkos4ViewIPKcJNS2_10LayoutLeftENS2_6DeviceINS2_4CudaENS2_9CudaSpaceEEENS2_12MemoryTraitsILj0EEEEEENS3_IPcJSA_vvEEEEENSt9enable_ifIXaasrNS2_7is_viewIT_EE5valuesrNSH_IT0_EE5valueEvE4typeERKSI_RKN7Teuchos9ArrayViewIKmEERKSK_SV_ + 0x30) [0x7f6c040]
========= Host Frame:../EMPIRE_PIC.exe (_ZN6Tpetra10DistObjectIcixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEE13doTransferNewERKNS_13SrcDistObjectENS_11CombineModeEmRKNS1_8DualViewIPKiNS1_6DeviceIS4_NS1_9CudaSpaceEEEvvEESK_SK_SK_RNS_11DistributorENS7_13ReverseOptionEbb + 0x1be8) [0x7fbf6d8]
========= Host Frame:../EMPIRE_PIC.exe (_ZN6Tpetra10DistObjectIcixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEE10doTransferERKNS_13SrcDistObjectERKNS_7Details8TransferIixS6_EEPKcNS7_13ReverseOptionENS_11CombineModeEb + 0x4cc) [0x7fb6fcc]
========= Host Frame:../EMPIRE_PIC.exe (_ZN6Tpetra10DistObjectIcixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_4CudaENS1_12CudaUVMSpaceEEEE8doExportERKNS_13SrcDistObjectERKNS_6ExportIixS6_EENS_11CombineModeEb + 0x3ec) [0x7fb4c5c]
========= Host Frame:../EMPIRE_PIC.exe (_ZN6panzer12L2ProjectionIixE15buildMassMatrixEbPKSt13unordered_mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEdSt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_dEEE + 0xe74) [0x2fd9f34]
========= Host Frame:../EMPIRE_PIC.exe (_ZN6empire7solvers30createMassMatrixFromDOFManagerERKN7Teuchos3RCPIN6panzer10DOFManagerIixEEEERKNS2_IN10panzer_stk13STK_InterfaceEEERKNS2_INS3_11ConnManagerEEEPKSt13unordered_mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEdSt4hashISO_ESt8equal_toISO_ESaISt4pairIKSO_dEEE + 0xa50) [0x2911eb0]
========= Host Frame:../EMPIRE_PIC.exe (_ZN6empire30ElectroMagneticSolverInterfaceC1E18MainParameterListsNS_13MeshContainerERNS_5utils9TimeStampEb + 0x1ac8) [0x2a475b8]
========= Host Frame:../EMPIRE_PIC.exe (_Z16meshSpecificMainIN6empire10MeshTraitsIN6shards11TetrahedronILj4EEELi1EEEEvRN7Teuchos3RCPINS6_12StackedTimerEEERKNS7_IKNS6_7MpiCommIiEEEEdR21MainPicParameterListsbRNS0_13MeshContainerERNS0_5utils9TimeStampE + 0x3d00) [0x211fd60]
========= Host Frame:../EMPIRE_PIC.exe (main + 0xda8) [0x1f658e8]
========= Host Frame:/lib64/libc.so.6 [0x25100]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xc4) [0x252f4]
@bathmatt It smells like something in EMPIRE is giving Tpetra a host pointer wrapped in a device Kokkos::View. I am adding a function to Tpetra now that will diagnose and report this issue.
@mhoemmen said
smells like something in EMPIRE is giving Tpetra a host pointer wrapped in a device Kokkos::View
I think this is occurring in Panzer tests as well so it is not unique to EMPIRE code.
THanks, how is this a heisenbug if it is that, wouldn't it always do that??
I can run, fail !!, pass, !! fail, ....
I'll take your patch and try it and see
FYI: I had a talk with @rppawlo this morning about these failing Panzer tests. If you look at all of the ATDM Trilinos failing Panzer tests since 2019-06-01 you can see that after @rppawlo merge of PR #5346 on 6/7/2019 which fixed a file read/write race, there has only been to failing Panzer tests on any ATDM Trilinos platform have been:
The test PanzerAdaptersSTK_tSTK_IO_MPI_1 failed due to a random I/O failure:
Exodus Library Warning/Error: [ex_create_int]
ERROR: file create failed for output.exo
ERROR: Could not open output database 'output.exo' of type 'exodus'
0. tSTK_IO_fields_UnitTest ... -------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
which looks to be unrelated.
The only other failing test PanzerAdaptersSTK_MixedCurlLaplacianExample-ConvTest-Quad-Order-2 shows:
================================================================================
TEST_2
Running: "/home/projects/ppc64le-pwr8-nvidia/openmpi/2.1.2/gcc/7.2.0/cuda/9.2.88/bin/mpiexec" "-np" "4" "-map-by" "socket:PE=4" "/home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/MixedCurlLaplacianExample/PanzerAdaptersSTK_MixedCurlLaplacianExample.exe" "--use-tpetra" "--use-twod" "--cell=Quad" "--x-elements=16" "--y-elements=16" "--z-elements=4" "--basis-order=2"
Writing output to file "/home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/MixedCurlLaplacianExample/MPE-ConvTest-Quad-2-16"
--------------------------------------------------------------------------------
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ride11 and rank 0!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ride11 and rank 1!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ride11 and rank 2!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ride11 and rank 3!
:0: : block: [1,0,0], thread: [0,75,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
...
:0: : block: [2,0,0], thread: [0,158,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaGetLastError() error( cudaErrorAssert): device-side assert triggered /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp:401
Traceback functionality not available
[ride11:53997] *** Process received signal ***
[ride11:53997] Signal: Aborted (6)
[ride11:53997] Signal code: (-6)
[ride11:53997] [ 0] [0x3fff8da20478]
[ride11:53997] [ 1] /lib64/libc.so.6(abort+0x2b4)[0x3fff80141f94]
...
@rppawlo thinks this might be due to a race condition but there is not enough context to determine where and he can't reproduce the problem himself.
To put this in context, a shown in this query there have been over 11K Panzer tests run across all ATDM CUDA builds since 6/7/2019 and only one fails in this way! So we should expect that this failure would be very hard to reproduce.
What we really need is for someone to add a stack trace when an error like:
:0: : block: [1,0,0], thread: [0,75,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
occurs. Then, when this occurs due to a rare random event, we see where this is happening.
For example, we have code in Teuchos that uses BinUtils functions to generate a readable stack trace whenever we want to as long as the code was built with -g
. This code has zero overhead until you ask for the stack trace (unlike a verbose printing option like Tpetra has).
Why can't we create a stack track like this with Kokkos and Tpetra code? That would seem to be a critical bit of functionality to debug rare failures like this.
Otherwise, since this this is just very rarely randomly failing, let's see what happens over the next few weeks to gather more data.
In the meantime, @rppawlo is going to see if he can reproduce this with EMPIRE since @bathmatt reported that he can get EMPIRE to fail in a similar way about 1/3 of the time.
This morning @bathmatt reran the empire problem that was exhibiting this issue and he can't reproduce it either anymore.
@rppawlo said:
This morning @bathmatt reran the empire problem that was exhibiting this issue and he can't reproduce it either anymore.
Then let's just put this Issue in review and watch it for a bit.
@bartlettroscoe wrote:
Why can't we create a stack track like this with Kokkos and Tpetra code? That would seem to be a critical bit of functionality to debug rare failures like this.
I'm pretty sure that "API to get stack trace of device code" isn't a thing. The best we could do is catch those errors at the kernel launch level. Would that be OK?
@mhoemmen said:
I'm pretty sure that "API to get stack trace of device code" isn't a thing. The best we could do is catch those errors at the kernel launch level. Would that be OK?
Yea, getting the stack trace for the host code would be a huge help I think to start to track this down. Can Kokkos be made to use the TEuchos stack trace stuff? Any code using TEUCHOS_TEST_FOR_EXCEPTION() and TEUCHOS_STANDARD_CATCH() automatically has this functionality.
@bartlettroscoe wrote:
Can Kokkos be made to use the Teuchos stack trace stuff?
Using the macros directly would introduce a circular package dependency (Teuchos already depends optionally on Kokkos). However, Kokkos could perfectly well reimplement that stuff (e.g., contents of Teuchos_stacktrace.cpp
).
@bartlettroscoe I added a Kokkos feature request that points back to the discussion here. Does Teuchos normally pull in that BFD stuff (in Teuchos_stacktrace.cpp
)? Is it needed for an effective stack trace?
@mhoemmen asked:
Does Teuchos normally pull in that BFD stuff (in
Teuchos_stacktrace.cpp
)? Is it needed for an effective stack trace?
I think yes. You need to provide file names and line numbers and demangled names. That is what that code does.
I added a Kokkos feature request that points back to the discussion here.
Using the macros directly would introduce a circular package dependency (Teuchos already depends optionally on Kokkos).
Perhaps, but it could be done through dependency inversion and dependency injection. This would allow users to inject any behavior they wanted when Kokkos raised an error and decided to abort. If you think about it, every library that aborts execution should provide a hook for these cases. The C++ standard library does.
However, Kokkos could perfectly well reimplement that stuff (e.g., contents of
Teuchos_stacktrace.cpp
).
Fine with me. Just make sure they write strong automated tests like exists in Teuchos.
If you think about it, every library that aborts execution should provide a hook for these cases.
Kokkos does actually; it's called push_finalize_hook
. (I added it :-) .)
@mhoemmen, we need to change Kokkos to support a mode so that instead of throwing std::runtime_error
, it gathers and prints a stacktrace and aborts. We should not underestimate how huge of an impact functionality like this will have on enabling the debug of random errors like this.
Here is another one for the test PanzerAdaptersSTK_MixedCurlLaplacianMultiblockExample-ConvTest-Quad-Order-1
in the build Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug
on 'ride' yesterday:
showing:
TEST_2
Running: "/home/projects/ppc64le-pwr8-nvidia/openmpi/2.1.2/gcc/7.2.0/cuda/9.2.88/bin/mpiexec" "-np" "4" "-map-by" "socket:PE=4" "/home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/MixedCurlLaplacianExample/PanzerAdaptersSTK_MixedCurlLaplacianExample.exe" "--use-tpetra" "--use-twod" "--x-blocks=2" "--cell=Quad" "--x-elements=32" "--y-elements=32" "--z-elements=4" "--basis-order=1" "--output-filename=multiblock-"
Writing output to file "/home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/MixedCurlLaplacianExample/MPE-Multiblock-ConvTest-Quad-1-32"
--------------------------------------------------------------------------------
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ride13 and rank 0!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ride13 and rank 1!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ride13 and rank 2!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ride13 and rank 3!
:0: : block: [1,0,0], thread: [0,224,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
:0: : block: [1,0,0], thread: [0,225,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
:0: : block: [1,0,0], thread: [0,227,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
:0: : block: [1,0,0], thread: [0,228,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
:0: : block: [1,0,0], thread: [0,230,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
:0: : block: [1,0,0], thread: [0,231,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
...
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaGetLastError() error( cudaErrorAssert): device-side assert triggered /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp:401
Traceback functionality not available
[ride13:97525] *** Process received signal ***
[ride13:97525] Signal: Aborted (6)
[ride13:97525] Signal code: (-6)
[ride13:97525] [ 0] [0x3fff97370478]
@crtrott and @mhoemmen, who has time on the Kokkos team put in the hooks for these failures to generate a stacktrace when random failures like this occur?
@bartlettroscoe Both of us have a deadline this week. I'll be happy to work on this afterwards.
@mhoemmen said:
Both of us have a deadline this week. I'll be happy to work on this afterwards.
That is fine. As far as I know, this is not super urgent (but it is important). @rppawlo has spent days trying to reproduce this and can't.
(Saving days of @rppawlo's and other developer's time tracking down random failures coming out of Kokkos would likely be classified as important :-) )
@rppawlo, @mhoemmen, @trilinos/tpetra,
We are also seeing the test TpetraCore_ImportExport2_UnitTests_MPI_4
randomly failing as well 3 times in the last 30 days as shown here showing the same Assertion 'View bounds error of view FixedHashTable::pairs' failed
error like:
0: : block: [4,0,0], thread: [0,32,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
...
:0: : block: [4,0,0], thread: [0,159,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
[ascicgpu14:08319] *** Process received signal ***
[ascicgpu14:08319] Signal: Floating point exception (8)
[ascicgpu14:08319] Signal code: Integer divide-by-zero (1)
[ascicgpu14:08319] Failing at address: 0x7f76a9998261
The final error shows:
[ascicgpu14:08319] Signal: Floating point exception (8)
[ascicgpu14:08319] Signal code: Integer divide-by-zero (1)
instead of an exception being thrown.
What is the chance these random TpetraCore test failures are not related to these random failures seen in Panzer?
NOTE: The were only two other Tpetra test failures in the last 30 days on any of the CUDA builds as shown here but they look to be unrelated or at least they don't show this view bounds error.
@trilinos/xpetra
We also saw this Assertion 'View bounds error of view FixedHashTable::pairs' failed
error in the test Xpetra_BlockedCrsMatrix_UnitTests_MPI_4
in the same build Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug
shown here showing:
27. BlockedCrsMatrix_double_int_int_Kokkos_Compat_KokkosCudaWrapperNode_ReorderBlockOperatorThyra_UnitTest ... :0: : block: [0,0,0], thread: [0,64,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
:0: : block: [1,0,0], thread: [0,64,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
...
:0: : block: [0,0,0], thread: [0,31,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
[ascicgpu14:14395] *** Process received signal ***
[ascicgpu14:14395] Signal: Floating point exception (8)
[ascicgpu14:14395] Signal code: Integer divide-by-zero (1)
Huh, these look rather new. @trilinos/tpetra
FYI: As shown this this long query we are not only seeing these failures in the build Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug
(but most are).
We also saw this same "View Bounds" error in the test:
in the build:
on 'ride' on 2019-06-12 showing:
:0: : block: [1,0,0], thread: [0,224,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
...
:0: : block: [0,0,0], thread: [0,95,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaGetLastError() error( cudaErrorAssert): device-side assert triggered /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp:401
Traceback functionality not available
[ride13:97525] *** Process received signal ***
[ride13:97525] Signal: Aborted (6)
[ride13:97525] Signal code: (-6)
[ride13:97525] [ 0] [0x3fff97370478]
If this is related to overloading the GPUs, how could we tell?
CC: @trilinos/muelu, @srajama1 (Trilinos Linear Solvers Product Area Lead)
We also saw this error in the test:
in the build:
on 'waterman' on 2019-06-19 on:
10. BlockedDirectSolver_double_int_int_Kokkos_Compat_KokkosCudaWrapperNode_NestedI53I42II01II_Thyra_BlockedDirectSolver_Setup_Apply_UnitTest ... MueLu::Amesos2Smoother: using "Klu"
Setup BlockedDirectSolver (MueLu::BlockedDirectSolver{type = blocked direct solver})
Setup Smoother (MueLu::Amesos2Smoother{type = Klu})
MergedBlockedMatrix (MueLu::MergedBlockedMatrixFactory)
A (merged) size = 640 x 640, nnz = 640
[empty list]
ReorderBlockA factory (MueLu::ReorderBlockAFactory)
Got a 6x6 blocked operator as input
Reordering A using [ 5 3 [ 4 2 ] [ 0 1 ] ] block gives a 4x4 blocked operators
Reorder Type = [5 3 [4 2] [0 1]]
:0: : block: [0,0,0], thread: [0,96,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
...
:0: : block: [1,0,0], thread: [0,155,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaDeviceSynchronize() error( cudaErrorAssert): device-side assert triggered /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119
Traceback functionality not available
[waterman2:103528] *** Process received signal ***
[waterman2:103528] Signal: Aborted (6)
[waterman2:103528] Signal code: (-6)
[waterman2:103528] [ 0] [0x7fff895104d8]
This this is not isolated to the machine 'ascicgpu14' running the 'sems-rhel7' builds (mostly n the Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug
build).
Need some more info from Kokkos to help us debug these failures.
FYI: Not a fluke for MueLu_UnitTestsBlockedTpetra_MPI_4
failure as it happened on 6/20/2019 as shown here (but not today). That was two days in a row on the same machine.
@ibaned, @crtrott, @nmhamster, @rppawlo,
Is there any chance that overloading the GPU could cause errors like this or is it more like a code defect (with a race condition)?
But all of these failures are in Tpetra tests or in tests downstream from Tpetra. If it was a problem with overloading the GPU would that not also allow native Kokkos tests to fail randomly too?
@trilinos/amesos2, @srajama1 (Trilinos Linear Solvers Product Lead)
As shown in this query the test:
Amesos2_SuperLU_DIST_Solver_Test_MPI_4
in the build:
Trilinos-atdm-waterman-cuda-9.2-release-debug
has randomly failed 3 times in the last 12 days and is showing failures like shown here showing:
:0: : block: [0,0,0], thread: [0,96,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
...
:0: : block: [0,0,0], thread: [0,95,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaGetLastError() error( cudaErrorAssert): device-side assert triggered /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp:401
Traceback functionality not available
[waterman2:27305] *** Process received signal ***
[waterman2:27305] Signal: Aborted (6)
[waterman2:27305] Signal code: User function (kill, sigsend, abort, etc.) (0)
[waterman2:27305] [ 0] [0x7fff804b04d8]
@bartlettroscoe : This feels like an environment issue or something new check in Tpetra. I will add @kddevin so she is aware of it.
@srajama1 said:
This feels like an environment issue or something new check in Tpetra. I will add @kddevin so she is aware of it.
As far as I know, the env on 'waterman' has not changed in a long time. Therefore, if it is an env issue, then a change in Trilinos is triggering this. Because these are all random failures, it is hard to determine what change might have triggered this. One thing all of these failing tests have in common is that they all use Tpetra. If we could get a stack trace out of Kokkos when this occurs, then we would be able to start narrowing this down perhaps.
Otherwise, we will just continue to monitor these failures. As long as the APPs are not seeing these failures, this is not high urgency to address. (It just creates extra noise in our triagging progress but I think we can extend our tools to more effectively filter based on this error.)
Bug Report
CC: @trilinos/panzer, @kddevin (Trilinos Data Services Product Lead), @srajama1 (Trilinos Linear Solver Data Services), @mperego (Trilinos Discretizations Product Lead), @bartlettroscoe, @fryeguy52
Next Action Status
Since PR #5346 was merged on 6/7/2019 which fixed a file read/write race in the test, there has only been one failing Panzer test on any ATDM Trilinos platform as of 6/11/2019 looking to be related. Also, on 6/11/2019 @bathmatt reported EMPIRE is not failing in a similar way in his recent tests. Next: Watch results over next few weeks to see if more random failures like this occur ...
Description
As shown in this query the tests:
are failing in the build:
Additionally the test:
is failing in a different build on the same machine:
Expand to see new commits on 2019-05-14
``` *** Base Git Repo: Trilinos 7b6d69a: Merge remote-tracking branch 'origin/develop' into atdm-nightly Author: Roscoe A. BartlettCurrent Status on CDash
Results for the current testing day
Steps to Reproduce
One should be able to reproduce this failure on waterman as described in:
More specifically, the commands given for waterman are provided at:
The exact commands to reproduce this issue should be: