trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.2k stars 564 forks source link

Tpetra, Xpetra, Amesos2, MueLu, and PanzerAdaptersSTK_* tests failing in many ATDM cuda 9.2 builds with Kokkos view bounds errors starting before 2019-05-15 #5179

Closed fryeguy52 closed 4 years ago

fryeguy52 commented 5 years ago

Bug Report

CC: @trilinos/panzer, @kddevin (Trilinos Data Services Product Lead), @srajama1 (Trilinos Linear Solver Data Services), @mperego (Trilinos Discretizations Product Lead), @bartlettroscoe, @fryeguy52

Next Action Status

Since PR #5346 was merged on 6/7/2019 which fixed a file read/write race in the test, there has only been one failing Panzer test on any ATDM Trilinos platform as of 6/11/2019 looking to be related. Also, on 6/11/2019 @bathmatt reported EMPIRE is not failing in a similar way in his recent tests. Next: Watch results over next few weeks to see if more random failures like this occur ...

Description

As shown in this query the tests:

are failing in the build:

Additionally the test:

is failing in a different build on the same machine:

Expand to see new commits on 2019-05-14 ``` *** Base Git Repo: Trilinos 7b6d69a: Merge remote-tracking branch 'origin/develop' into atdm-nightly Author: Roscoe A. Bartlett Date: Mon May 13 21:05:15 2019 -0600 085e9d8: Merge Pull Request #5138 from trilinos/Trilinos/zoltan_fix5106 Author: trilinos-autotester Date: Mon May 13 18:57:23 2019 -0600 238800a: Merge pull request #5163 from kyungjoo-kim/fix-5148 Author: kyungjoo-kim Date: Mon May 13 15:05:45 2019 -0600 7b827c7: Tpetra: resolution to #5161 (#5162) Author: Tim Fuller Date: Mon May 13 14:57:31 2019 -0600 D packages/tpetra/core/src/Tpetra_Experimental_BlockMultiVector.cpp 925a0a7: Ifpack2 - fix for #5148 Author: Kyungjoo Kim Date: Mon May 13 11:52:27 2019 -0600 M packages/ifpack2/src/Ifpack2_BlockTriDiContainer_impl.hpp a847648: Made even bigger. Author: K. Devine Date: Wed May 8 10:48:35 2019 -0600 M packages/zoltan/src/driver/dr_main.c M packages/zoltan/src/driver/dr_mainCPP.cpp 9051d9f: zoltan: minor change to fix #5106 Author: K. Devine Date: Wed May 8 10:38:59 2019 -0600 M packages/zoltan/src/driver/dr_main.c M packages/zoltan/src/driver/dr_mainCPP.cpp ```

Current Status on CDash

Results for the current testing day

Steps to Reproduce

One should be able to reproduce this failure on waterman as described in:

More specifically, the commands given for waterman are provided at:

The exact commands to reproduce this issue should be:

$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-waterman-cuda-9.2-opt
$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Panzer=ON \
 $TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -n 20 ctest -j20
bathmatt commented 5 years ago

EMPIRE was seeing the integer divide by 0, it was when creating a view, there was a long discussino on the kokkos IM channel. It came and went and I believe was a cuda compiler bug. DavidH was looking at it as well

srajama1 commented 5 years ago

We can follow up with NVIDIA if we have an example to reproduce this.

bartlettroscoe commented 5 years ago

@srajama1

We can follow up with NVIDIA if we have an example to reproduce this.

Does someone not need to isolate the code in Trilinos that is triggering this first before someone can create a reproducer for NVIDA? Currently I don't think we know what code is triggering this. Having stack traces should be a good start.

trmcnealy commented 5 years ago

I'm having a similar issue with (Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete) but its not random. Its every time. I reduced the size of the mesh and fillComplete did not throw an error.

rppawlo commented 5 years ago

@kddevin @srajama1 - @bathmatt ping you on a few issues. This is the original ticket with more information about the failures across the packages.

mhoemmen commented 5 years ago

We're stuck on trying to get a TPL into Kokkos. I need some advice from someone like @ibaned or @ndellingwood on that.

srajama1 commented 5 years ago

Is there a Kokkos issue corresponding to that ?

mhoemmen commented 5 years ago

There's a Kokkos PR: https://github.com/kokkos/kokkos/pull/2226 . We don't want to merge just yet because backtrace* are not POSIX standard functions. This is why we need to figure out how to add a Kokkos TPL. The usual TriBITS TPL mechanism doesn't work, because KokkosCore_config.h doesn't use the usual TriBITS header file generation process.

jhux2 commented 5 years ago

The MueLu tests MueLu_DriverTpetraILU_MPI_4, MueLu_DriverTpetra_WithGlobalConstants_MPI_4, and MueLu_UnitTestsTpetra_MPI_4 are failing randomly and frequently with the error

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:120
Traceback functionality not available

There are a few other MueLu tests that fail the same way, but less frequently.

Sometimes MueLu_UnitTestsTpetra_MPI_4 fails instead with an error like

:0: : block: [9,0,0], thread: [0,224,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.

Here's the search query I used.

jhux2 commented 5 years ago

I have been able to reproduce on waterman in the MueLu scaling driver "Driver.cpp". In that code, the error manifests either during the initial matrix map construction or during the matrix construction. Here is the stack trace with Trilinos dev, SHA 6550bd788b, with some minor modifications to Driver.cpp to make reproducing easier.

#0  0x00007fff7572faf0 in raise () from /lib64/libc.so.6
#1  0x00007fff75731e6c in abort () from /lib64/libc.so.6
#2  0x00007fff759d0774 in __gnu_cxx::__verbose_terminate_handler () at ../../.././libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00007fff759cb504 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../.././libstdc++-v3/libsupc++/eh_terminate.cc:47
#4  0x00007fff759c9928 in __cxa_call_terminate (ue_header=0x33f90950) at ../../.././libstdc++-v3/libsupc++/eh_call.cc:54
#5  0x00007fff759caaec in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=<optimized out>, exception_class=<optimized out>, ue_header=0x33f90950, context=0x7ffff2b4c770) at ../../.././libstdc++-v3/libsupc++/eh_personality.cc:676
#6  0x00007fff758ec084 in _Unwind_RaiseException_Phase2 (exc=0x33f90950, context=0x7ffff2b4c770) at ../.././libgcc/unwind.inc:62
#7  0x00007fff758ecc04 in _Unwind_Resume (exc=0x33f90950) at ../.././libgcc/unwind.inc:230
#8  0x0000000012e571cc in Kokkos::Impl::ViewValueFunctor<Kokkos::Cuda, Kokkos::pair<long long, int>, false>::execute (this=0x339678e0, arg=true) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/impl/Kokkos_ViewMapping.hpp:2535
#9  0x0000000012e5b01c in Kokkos::Impl::ViewValueFunctor<Kokkos::Cuda, Kokkos::pair<long long, int>, false>::destroy_shared_allocation (this=0x339678e0) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/impl/Kokkos_ViewMapping.hpp:2553
#10 0x0000000012e599b4 in Kokkos::Impl::(anonymous namespace)::deallocate<Kokkos::CudaUVMSpace, Kokkos::Impl::ViewValueFunctor<Kokkos::Cuda, Kokkos::pair<long long, int>, false> > (record_ptr=0x33967890) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/impl/Kokkos_SharedAlloc.hpp:196
#11 0x0000000014ca4eac in Kokkos::Impl::SharedAllocationRecord<void, void>::decrement (arg_record=0x33967890) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/impl/Kokkos_SharedAlloc.cpp:273
#12 0x0000000012e363b4 in ~SharedAllocationTracker (this=0x7ffff2b4db50, __in_chrg=<optimized out>) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/impl/Kokkos_SharedAlloc.hpp:358
#13 Kokkos::View<Kokkos::pair<long long, int>*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Kokkos::MemoryTraits<0u> >::~View (this=0x7ffff2b4db50, __in_chrg=<optimized out>) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/Kokkos_View.hpp:1972
#14 0x0000000012e237fc in Tpetra::Details::FixedHashTable<long long, int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::init (this=0x7ffff2b4df50, keys=..., startingValue=50, initMinKey=5150, initMaxKey=5199, firstContigKey=5050, lastContigKey=5099, computeInitContigKeys=true) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_Details_FixedHashTable_def.hpp:1188
#15 0x0000000012e20828 in Tpetra::Details::FixedHashTable<long long, int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::FixedHashTable (this=0x7ffff2b4df50, keys=..., firstContigKey=5050, lastContigKey=5099, startingValue=50, keepKeys=false) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_Details_FixedHashTable_def.hpp:803
#16 0x0000000012f35868 in Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::initWithNonownedHostIndexList (this=0x33dbab10, numGlobalElements=10000, entryList_host=..., indexBase=0, comm=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_Map_def.hpp:695
#17 0x0000000012f2eb5c in Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::Map (this=0x33dbab10, numGlobalElements=10000, entryList=..., indexBase=0, comm=..., __in_chrg=<optimized out>, __vtt_parm=<optimized out>) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_Map_def.hpp:866
#18 0x000000001289d3d8 in Xpetra::TpetraMap<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::TpetraMap (this=0x33f904c0, numGlobalElements=10000, elementList=..., indexBase=0, comm=..., __in_chrg=<optimized out>, __vtt_parm=<optimized out>) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/xpetra/src/Map/Xpetra_TpetraMap_def.hpp:133
#19 0x0000000010148edc in Galeri::Xpetra::MapTraits<long long, Xpetra::TpetraMap<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > >::Build (numGlobalElements=10000, elementList=..., indexBase=0, comm=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/galeri/src-xpetra/Galeri_MapTraits.hpp:123
#20 0x0000000010120990 in Galeri::Xpetra::Maps::Cartesian2D<int, long long, Xpetra::TpetraMap<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > > (comm=..., nx=100, ny=100, mx=2, my=2, list=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/galeri/src-xpetra/Galeri_XpetraCartesian.hpp:147
#21 0x00000000100f7c88 in Galeri::Xpetra::CreateMap<int, long long, Xpetra::TpetraMap<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > > (mapType="Cartesian2D", comm=..., list=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/galeri/src-xpetra/Galeri_XpetraMaps.hpp:256
#22 0x00000000100d4138 in Galeri::Xpetra::CreateMap<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > (lib=Xpetra::UseTpetra, mapType="Cartesian2D", comm=..., list=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/galeri/src-xpetra/Galeri_XpetraMaps.hpp:146
#23 0x00000000100bf8f0 in MatrixLoad<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > (comm=..., lib=@0x7ffff2b52514: Xpetra::UseTpetra, binaryFormat=false, matrixFile="", rhsFile="", rowMapFile="", colMapFile="", domainMapFile="", rangeMapFile="", coordFile="", nullFile="", map=..., A=..., coordinates=..., nullspace=..., X=..., B=..., numVectors=1, galeriParameters=..., xpetraParameters=..., galeriStream=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/MatrixLoad.hpp:125
#24 0x00000000100ade74 in main_<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > (clp=..., lib=@0x7ffff2b52514: Xpetra::UseTpetra, argc=1, argv=0x7ffff2b52ca8) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/Driver.cpp:353
#25 0x000000001009f494 in Automatic_Test_ETI (argc=1, argv=0x7ffff2b52ca8) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/../unit_tests/MueLu_Test_ETI.hpp:162
#26 0x00000000100a0534 in main (argc=1, argv=0x7ffff2b52ca8) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/Driver.cpp:604
jhux2 commented 5 years ago

To reproduce on waterman:

do-config.txt 0001-MueLu-DO-NOT-COMMIT.txt

bathmatt commented 5 years ago

I saw lots of errors in this section when I compiled with -fsanitize and openmp. It was use of stack data after function or some such stuff.

kokkos has a lot of these warnings, not sure if they are real or not.

jhux2 commented 5 years ago

I added an extra fence in Tpetra_Details_FixedHashTable_def.hpp on line 1182, and the errors of the form

:0: : block: [3,0,0], thread: [0,151,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.

go away. Changing line 1164 from if (buildInParallel) to if (false) also seems to make these types of errors go away.

CORRECTION: The extra fence has no effect. Changing line 1164 does make the error go away.

jhux2 commented 5 years ago

I'm still seeing another type of error during FillComplete of the matrix in MueLu's Driver.cpp. Here is that backtrace.

#0  0x000000001402a678 in Tpetra::Distributor::doPosts<Kokkos::View<char const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::View<char*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> > (this=0x6fbe2c30, exports=..., numExportPacketsPerLID=..., imports=..., numImportPacketsPerLID=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_Distributor.hpp:2771
#1  0x0000000014024a30 in Tpetra::Distributor::doPostsAndWaits<Kokkos::View<char const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::View<char*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> > (this=0x6fbe2c30, exports=..., numExportPacketsPerLID=..., imports=..., numImportPacketsPerLID=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_Distributor.hpp:2049
#2  0x0000000014020704 in Tpetra::DistObject<char, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doTransferNew (this=0x6fb75a08, src=..., CM=Tpetra::ADD, numSameIDs=0, permuteToLIDs=..., permuteFromLIDs=..., remoteLIDs=..., exportLIDs=..., distor=..., revOp=Tpetra::DistObject<char, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::DoForward, commOnHost=false, restrictedMode=false) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_DistObject_def.hpp:1184
#3  0x0000000014019bc8 in Tpetra::DistObject<char, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doTransfer (this=0x6fb75a08, src=..., transfer=..., modeString=0x7ffff48c1238 "doExport (forward mode)", revOp=Tpetra::DistObject<char, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::DoForward, CM=Tpetra::ADD, restrictedMode=false) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_DistObject_def.hpp:606
#4  0x000000001401631c in Tpetra::DistObject<char, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doExport (this=0x6fb75a08, source=..., exporter=..., CM=Tpetra::ADD, restrictedMode=false) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_DistObject_def.hpp:347
#5  0x0000000013b78a74 in Tpetra::CrsMatrix<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::globalAssemble (this=0x6fb75a00) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_CrsMatrix_def.hpp:4850
#6  0x0000000013b79a18 in Tpetra::CrsMatrix<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete (this=0x6fb75a00, domainMap=..., rangeMap=..., params=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_CrsMatrix_def.hpp:5039
#7  0x0000000013b7ab14 in Tpetra::CrsMatrix<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete (this=0x6fb75a00, params=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_CrsMatrix_def.hpp:4974
#8  0x00000000128c2040 in Xpetra::TpetraCrsMatrix<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete (this=0x6fb759a0, params=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/xpetra/src/CrsMatrix/Xpetra_TpetraCrsMatrix_def.hpp:232
#9  0x00000000128857b0 in Xpetra::CrsMatrixWrap<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete (this=0x6fb74f10, params=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/xpetra/sup/Matrix/Xpetra_CrsMatrixWrap_def.hpp:213
#10 0x0000000010235ef8 in Galeri::Xpetra::Cross2D<double, int, long long, Xpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >, Xpetra::CrsMatrixWrap<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > > (map=..., nx=100, ny=100, a=4, b=-1, c=-1, d=-1, e=-1, DirichletBC=63, keepBCs=false) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/galeri/src-xpetra/Galeri_XpetraMatrixTypes.hpp:272
#11 0x0000000010208a44 in Galeri::Xpetra::Laplace2DProblem<double, int, long long, Xpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >, Xpetra::CrsMatrixWrap<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >, Xpetra::MultiVector<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > >::BuildMatrix (this=0x6fb749a0) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/galeri/src-xpetra/Galeri_StencilProblems.hpp:146
#12 0x00000000100c0208 in MatrixLoad<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > (comm=..., lib=@0x7ffff48c4d64: Xpetra::UseTpetra, binaryFormat=false, matrixFile="", rhsFile="", rowMapFile="", colMapFile="", domainMapFile="", rangeMapFile="", coordFile="", nullFile="", map=..., A=..., coordinates=..., nullspace=..., X=..., B=..., numVectors=1, galeriParameters=..., xpetraParameters=..., galeriStream=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/MatrixLoad.hpp:155
#13 0x00000000100adbbc in main_<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > (clp=..., lib=@0x7ffff48c4d64: Xpetra::UseTpetra, argc=6, argv=0x7ffff48c5428) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/Driver.cpp:352
#14 0x000000001009f4d8 in Automatic_Test_ETI (argc=6, argv=0x7ffff48c5428) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/../unit_tests/MueLu_Test_ETI.hpp:160
#15 0x00000000100a029c in main (argc=6, argv=0x7ffff48c5428) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/Driver.cpp:571
jhux2 commented 5 years ago

Here's a third type of error:

click to open ``` #0 0x00007fff8fd8faf0 in raise () from /lib64/libc.so.6 #1 0x00007fff8fd91e6c in abort () from /lib64/libc.so.6 #2 0x00007fff90030774 in __gnu_cxx::__verbose_terminate_handler () at ../../.././libstdc++-v3/libsupc++/vterminate.cc:95 #3 0x00007fff9002b504 in __cxxabiv1::__terminate (handler=) at ../../.././libstdc++-v3/libsupc++/eh_terminate.cc:47 #4 0x00007fff90029928 in __cxa_call_terminate (ue_header=0x5a748d20) at ../../.././libstdc++-v3/libsupc++/eh_call.cc:54 #5 0x00007fff9002aaec in __cxxabiv1::__gxx_personality_v0 (version=, actions=, exception_class=, ue_header=0x5a748d20, context=0x7fffea4dcc00) at ../../.././libstdc++-v3/libsupc++/eh_personality.cc:676 #6 0x00007fff8ff4c084 in _Unwind_RaiseException_Phase2 (exc=0x5a748d20, context=0x7fffea4dcc00) at ../.././libgcc/unwind.inc:62 #7 0x00007fff8ff4cc04 in _Unwind_Resume (exc=0x5a748d20) at ../.././libgcc/unwind.inc:230 #8 0x0000000014ca939c in Kokkos::Impl::cuda_internal_error_throw (e=cudaErrorIllegalAddress, name=0x1535c810 "cudaDeviceSynchronize()", file=0x1535c7a8 "/ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp", line=120) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:125 #9 0x00000000100a4d74 in Kokkos::Impl::cuda_internal_safe_call (e=cudaErrorIllegalAddress, name=0x1535c810 "cudaDeviceSynchronize()", file=0x1535c7a8 "/ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp", line=120) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Error.hpp:58 #10 0x0000000014ca91a4 in Kokkos::Impl::cuda_device_synchronize () at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:120 #11 0x0000000014cabefc in Kokkos::Cuda::impl_static_fence () at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:794 #12 0x0000000014cae450 in Kokkos::CudaUVMSpace::deallocate (this=0x5a879428, arg_alloc_ptr=0x7ffec0c00000) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:220 #13 0x0000000014caf074 in Kokkos::Impl::SharedAllocationRecord::~SharedAllocationRecord (this=0x5a8793e0, __in_chrg=) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:400 #14 0x000000001077d7d8 in Kokkos::Impl::SharedAllocationRecord >::~SharedAllocationRecord (this=0x5a8793e0, __in_chrg=) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/impl/Kokkos_SharedAlloc.hpp:214 #15 0x000000001077d828 in Kokkos::Impl::SharedAllocationRecord >::~SharedAllocationRecord (this=0x5a8793e0, __in_chrg=) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/impl/Kokkos_SharedAlloc.hpp:214 #16 0x000000001077d8c4 in Kokkos::Impl::(anonymous namespace)::deallocate > (record_ptr=0x5a8793e0) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/impl/Kokkos_SharedAlloc.hpp:198 #17 0x0000000014ca038c in Kokkos::Impl::SharedAllocationRecord::decrement (arg_record=0x5a8793e0) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/impl/Kokkos_SharedAlloc.cpp:273 #18 0x000000001101a9b4 in ~SharedAllocationTracker (this=0x7fffea4de668, __in_chrg=) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/impl/Kokkos_SharedAlloc.hpp:358 #19 Kokkos::View::~View (this=0x7fffea4de668, __in_chrg=) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/Kokkos_View.hpp:1972 #20 0x000000001107673c in KokkosKernels::Impl::UniformMemoryPool::~UniformMemoryPool (this=0x7fffea4de620, __in_chrg=) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos-kernels/src/common/KokkosKernels_Uniform_Initialized_MemoryPool.hpp:238 #21 0x0000000012cfcc34 in KokkosSparse::Impl::KokkosSPGEMM, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> > >::KokkosSPGEMM_numeric_hash, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> > > (this=0x7fffea4def58, rowmapC_=..., entriesC_=..., valuesC_=..., lcl_my_exec_space=KokkosKernels::Impl::Exec_CUDA) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos-kernels/src/sparse/impl/KokkosSparse_spgemm_impl_kkmem.hpp:1398 #22 0x0000000012cc35f8 in KokkosSparse::Impl::KokkosSPGEMM, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> > >::KokkosSPGEMM_numeric, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> > > (this=0x7fffea4def58, rowmapC_=..., entriesC_=..., valuesC_=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos-kernels/src/sparse/impl/KokkosSparse_spgemm_impl_def.hpp:76 #23 0x0000000012c8382c in KokkosSparse::Impl::SPGEMM_NUMERIC, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, Kokkos::View, Kokkos::MemoryTraits<1u> >, false, false>::spgemm_numeric (handle=0x7fffea4df1c0, m=468, n=2500, k=486, row_mapA=..., entriesA=..., valuesA=..., transposeA=false, row_mapB=..., entriesB=..., valuesB=..., transposeB=false, row_mapC=..., entriesC=..., valuesC=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos-kernels/src/sparse/impl/KokkosSparse_spgemm_numeric_spec.hpp:286 #24 0x0000000012c59b04 in KokkosSparse::Experimental::spgemm_numeric, Kokkos::View, Kokkos::MemoryTraits<0u> >, Kokkos::View, Kokkos::MemoryTraits<0u> >, Kokkos::View, void>, Kokkos::View, Kokkos::MemoryTraits<0u> >, Kokkos::View, Kokkos::MemoryTraits<0u> >, Kokkos::View, void>, Kokkos::View, Kokkos::MemoryTraits<0u> >, Kokkos::View, Kokkos::MemoryTraits<0u> >, Kokkos::View, Kokkos::MemoryTraits<0u> > > (handle=0x7fffea4dfd18, m=468, n=2500, k=486, row_mapA=..., entriesA=..., valuesA=..., transposeA=false, row_mapB=..., entriesB=..., valuesB=..., transposeB=false, row_mapC=..., entriesC=..., valuesC=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos-kernels/src/sparse/KokkosSparse_spgemm_numeric.hpp:249 #25 0x0000000012c02c68 in Tpetra::MMdetails::KernelWrappers, Kokkos::View > >::mult_A_B_newmatrix_kernel_wrapper (Aview=..., Bview=..., Acol2Brow=..., Acol2Irow=..., Bcol2Ccol=..., Icol2Ccol=..., C=..., Cimport=..., label="Laplace2D: MueLu::R*(AP)-implicit-1", params=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/ext/TpetraExt_MatrixMatrix_Cuda.hpp:219 #26 0x0000000012b6a4c0 in Tpetra::MMdetails::mult_A_B_newmatrix > (Aview=..., Bview=..., C=..., label="Laplace2D: MueLu::R*(AP)-implicit-1", params=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/ext/TpetraExt_MatrixMatrix_def.hpp:2068 #27 0x0000000012ba3e00 in Tpetra::MMdetails::mult_AT_B_newmatrix > (A=..., B=..., C=..., label="Laplace2D: MueLu::R*(AP)-implicit-1", params=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/ext/TpetraExt_MatrixMatrix_def.hpp:1618 #28 0x0000000012b394dc in Tpetra::MatrixMatrix::Multiply > (A=..., transposeA=true, B=..., transposeB=false, C=..., call_FillComplete_on_result=true, label="Laplace2D: MueLu::R*(AP)-implicit-1", params=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/ext/TpetraExt_MatrixMatrix_def.hpp:253 #29 0x00000000106baa60 in Xpetra::MatrixMatrix >::Multiply (A=..., transposeA=true, B=..., transposeB=false, C=..., call_FillComplete_on_result=true, doOptimizeStorage=true, label="Laplace2D: MueLu::R*(AP)-implicit-1", params=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/xpetra/sup/Utils/Xpetra_MatrixMatrix.hpp:283 #30 0x00000000106b6780 in Xpetra::MatrixMatrix >::Multiply (A=..., transposeA=true, B=..., transposeB=false, C_in=..., fos=..., doFillComplete=true, doOptimizeStorage=true, label="Laplace2D: MueLu::R*(AP)-implicit-1", params=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/xpetra/sup/Utils/Xpetra_MatrixMatrix.hpp:366 #31 0x0000000010abe774 in MueLu::RAPFactory >::Build (this=0x5a83b550, fineLevel=..., coarseLevel=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/src/Misc/MueLu_RAPFactory_def.hpp:202 #32 0x0000000010265d44 in MueLu::TwoLevelFactoryBase::CallBuild (this=0x5a83b550, requestedLevel=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/src/Interface/../MueCentral/MueLu_TwoLevelFactoryBase.hpp:151 #33 0x00000000103d6f8c in MueLu::Level::Get > > > (this=0x5a85fd10, ename="A", factory=0x5a83b550) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/src/Interface/../MueCentral/MueLu_Level.hpp:203 #34 0x00000000104ea8b8 in MueLu::Factory::Get > > > (this=0x5a8072b0, level=..., varName="A") at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/src/Interface/../MueCentral/MueLu_Factory.hpp:156 #35 0x0000000010b6294c in MueLu::RepartitionFactory >::Build (this=0x5a8072b0, currentLevel=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/src/Rebalancing/MueLu_RepartitionFactory_def.hpp:118 #36 0x00000000102666f0 in MueLu::SingleLevelFactoryBase::CallBuild (this=0x5a8072b0, requestedLevel=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/src/Interface/../MueCentral/MueLu_SingleLevelFactoryBase.hpp:133 #37 0x00000000108b8f2c in MueLu::Level::Get > const> > (this=0x5a85fd10, ename="Importer", factory=0x5a8072b0) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/src/MueCentral/MueLu_Level.hpp:203 #38 0x0000000010ae2690 in MueLu::Factory::Get > const> > (this=0x5a62be40, level=..., varName="Importer") at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/src/MueCentral/MueLu_Factory.hpp:156 #39 0x0000000010b28ce8 in MueLu::RebalanceTransferFactory >::Build (this=0x5a62be40, fineLevel=..., coarseLevel=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/src/Rebalancing/MueLu_RebalanceTransferFactory_def.hpp:141 #40 0x0000000010265d44 in MueLu::TwoLevelFactoryBase::CallBuild (this=0x5a62be40, requestedLevel=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/src/Interface/../MueCentral/MueLu_TwoLevelFactoryBase.hpp:151 #41 0x000000001037a1a4 in MueLu::Level::Get > > > (this=0x5a85fd10, ename="P", factory=0x5a62be40) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/src/Interface/../MueCentral/MueLu_Level.hpp:203 #42 0x0000000010c7bf00 in MueLu::TopRAPFactory >::Build (this=0x7fffea4ea910, coarseLevel=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/src/MueCentral/MueLu_TopRAPFactory_def.hpp:57 #43 0x00000000108a81c4 in MueLu::Hierarchy >::Setup (this=0x5a7b9ed0, coarseLevelID=1, fineLevelManager=..., coarseLevelManager=..., nextLevelManager=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/src/MueCentral/MueLu_Hierarchy_def.hpp:377 #44 0x000000001032587c in MueLu::HierarchyManager >::SetupHierarchy (this=0x5a74acb0, H=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/src/Interface/MueLu_HierarchyManager.hpp:238 #45 0x00000000102ba33c in MueLu::ParameterListInterpreter >::SetupHierarchy (this=0x5a74acb0, H=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/src/Interface/MueLu_ParameterListInterpreter_def.hpp:2203 #46 0x00000000100dd6b0 in MueLu::CreateXpetraPreconditioner > (op=..., inParamList=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/adapters/xpetra/MueLu_CreateXpetraPreconditioner.hpp:97 #47 0x00000000100c3548 in PreconditionerSetup > (A=..., coordinates=..., nullspace=..., mueluList=..., profileSetup=false, useAMGX=false, useML=false, numRebuilds=0, H=..., Prec=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/DriverCore.hpp:210 #48 0x00000000100af338 in main_ > (clp=..., lib=@0x7fffea4ed914: Xpetra::UseTpetra, argc=2, argv=0x7fffea4ee0a8) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/Driver.cpp:461 #49 0x000000001009f414 in Automatic_Test_ETI (argc=2, argv=0x7fffea4ee0a8) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/../unit_tests/MueLu_Test_ETI.hpp:160 #50 0x00000000100a04b4 in main (argc=2, argv=0x7fffea4ee0a8) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/Driver.cpp:597 ```
kddevin commented 5 years ago

@jhux2 How do you produce the second and third type of error above once you modify the fixed hash table?

jhux2 commented 5 years ago

@kddevin Here are some instructions on reproducing the type 2/3 errors on waterman:

0001-MueLu-DO-NOT-COMMIT.txt

kddevin commented 5 years ago

I submitted PR #5715 that provides a temporary workaround for the most reproducible of these errors. We'll reverse the workaround as we understand the issue better. Until then, it may be worth trying in the application to determine whether it allows the application to make progress. @bathmatt

kddevin commented 5 years ago

@bartlettroscoe Is there an easy way for me to build these tests with a different compiler and/or CUDA version on waterman? Since the tests pass with many other compilers on other platforms, I'd just like to try them on waterman with a different configuration. Thanks.

bartlettroscoe commented 5 years ago

@kddevin asked:

Is there an easy way for me to build these tests with a different compiler and/or CUDA version on waterman?

You can edit the files under:

Trilinos/cmake/std/atdm/waterman/

locally and put in whatever you want.

I am also working on #4933 that will allow you to set up and load any env you want.

Just be warned that you may have issues with the TPLs needed if you pick a compiler or options that don't already work with TPLs already installed. We were hoping to be able to do that with Spack but it has been going very slowly and we can't do that yet on 'waterman'.

kddevin commented 5 years ago

@bartlettroscoe Thanks. I am terrible at getting all the TPLs, etc., aligned. I was hoping that you had, say, just one other configuration that you knew "worked" and that I could load easily. I'm not picky, as long as it is different from the one used here. Do you have anything like that?

bartlettroscoe commented 5 years ago

@kddevin, specifically, what compilers/configurations do you want to try that are not already supported in:

Trilinos/cmake/std/atdm/waterman/environment.sh

?

We don't test or support many different configurations because builds are expensive, and the APPs don't need them. We only try to support just what the APPs need (and even struggle with just that).

jhux2 commented 5 years ago

@kddevin I talked with @crtrott, who recommended this option:

-DKokkos_ENABLE_Profiling:ON

It's explicitly turned off on the dashboard. To enable it, you much toggle its value in /ascldap/users/jhu/software/trilinos/Trilinos/cmake/std/atdm/ATDMDevEnvSettings.cmake.

With this option enabled, I'm seeing an assertion right away in FixedHashTable.

I was mistaken and did not have the FixedHashTable patch applied. With the patch applied, I've not seen any assertions yet.

crtrott commented 5 years ago

I think what you are looking at is misleading. This is all just delayed error checking for CUDA Kernels. When using the profiling tool on the thing Jonathan run it crashes on two ranks with an illegal memory access inside of KokkosKernels SPGEMM inside the Laplace2d MueLu setup:

[1,1]<stdout>:KokkosP: Allocate<CudaUVM> name: entriesC pointer: 0x7fff0002a080 size: 8128
[1,1]<stdout>:KokkosP: Allocate<CudaUVM> name: valuesC pointer: 0x7fff00030280 size: 16256
[1,1]<stdout>:KokkosP: Allocate<CudaUVM> name: pool data pointer: 0x7fff00a00080 size: 6029312
[1,1]<stdout>:KokkosP: Allocate<CudaUVM> name: locks pointer: 0x7fff00815080 size: 131072
[1,1]<stdout>:KokkosP: Executing parallel-for kernel on device 0 with unique execution identifier 1137
[1,1]<stdout>:KokkosP: Driver: S - Global Time
[1,1]<stdout>:KokkosP:   timername
[1,1]<stdout>:KokkosP:     MueLu setup time (Laplace2D)
[1,1]<stdout>:KokkosP:       Kokkos::View::initialization
[1,1]<stdout>:KokkosP: Execution of kernel 1137 is completed.
[1,1]<stdout>:KokkosP: Executing parallel-for kernel on device 0 with unique execution identifier 1138
[1,1]<stdout>:KokkosP: Driver: S - Global Time
[1,1]<stdout>:KokkosP:   timername
[1,1]<stdout>:KokkosP:     MueLu setup time (Laplace2D)
[1,1]<stdout>:KokkosP:       Kokkos::ViewFill-1D
[1,1]<stdout>:KokkosP: Execution of kernel 1138 is completed.
[1,1]<stdout>:KokkosP: Executing parallel-for kernel on device 0 with unique execution identifier 1139
[1,1]<stdout>:KokkosP: Driver: S - Global Time
[1,1]<stdout>:KokkosP:   timername
[1,1]<stdout>:KokkosP:     MueLu setup time (Laplace2D)
[1,1]<stdout>:KokkosP:       KOKKOSPARSE::SPGEMM::SPGEMM_KK_MEMORY
[1,1]<stderr>:terminate called after throwing an instance of 'std::runtime_error'
[1,1]<stderr>:  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:120
[1,1]<stderr>:Traceback functionality not available
[1,1]<stderr>:
[1,1]<stderr>:[waterman1:136183] *** Process received signal ***
[1,1]<stderr>:[waterman1:136183] Signal: Aborted (6)
[1,1]<stderr>:[waterman1:136183] Signal code:  (-6)
[1,1]<stderr>:[waterman1:136183] [ 0] [0x7fff834604d8]
[1,1]<stderr>:[waterman1:136183] [ 1] [1,1]<stderr>:/lib64/libc.so.6(abort+0x2b4)[0x7fff75b91f94]
[1,1]<stderr>:[waterman1:136183] [ 2] [1,1]<stderr>:/home/projects/ppc64le/gcc/7.2.0/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x1c4)[0x7fff75e30774]
[1,1]<stderr>:[waterman1:136183] [ 3] /home/projects/ppc64le/gcc/7.2.0/lib64/libstdc++.so.6(+0xab504)[0x7fff75e2b504]
[1,1]<stderr>:[waterman1:136183] [ 4] /home/projects/ppc64le/gcc/7.2.0/lib64/libstdc++.so.6(+0xa9928)[0x7fff75e29928]
[1,1]<stderr>:[waterman1:136183] [ 5] [1,1]<stderr>:/home/projects/ppc64le/gcc/7.2.0/lib64/libstdc++.so.6(__gxx_personality_v0+0x52c)[0x7fff75e2aaec]
[1,1]<stderr>:[waterman1:136183] [ 6] /home/projects/ppc64le/gcc/7.2.0/lib64/libgcc_s.so.1(+0xc084)[0x7fff75d4c084]
[1,1]<stderr>:[waterman1:136183] [ 7] /home/projects/ppc64le/gcc/7.2.0/lib64/libgcc_s.so.1(_Unwind_Resume+0x174)[0x7fff75d4cc04]
[1,1]<stderr>:[waterman1:136183] [1,1]<stderr>:[ 8] ./MueLu_Driver.exe-prof[0x14f32be0]
[1,1]<stderr>:[waterman1:136183] [ 9] ./MueLu_Driver.exe-prof[0x100b83bc]
[1,1]<stderr>:[waterman1:136183] [10] ./MueLu_Driver.exe-prof[0x14f329e8]
[1,1]<stderr>:[waterman1:136183] [11] ./MueLu_Driver.exe-prof[0x14f35750]
[1,1]<stderr>:[waterman1:136183] [12] ./MueLu_Driver.exe-prof[0x14f389d0]
[1,1]<stderr>:[waterman1:136183] [13] ./MueLu_Driver.exe-prof[0x10798410]

I am running with:

export KOKKOS_NUM_DEVICES=1
mpiexec -np 8 --tag-output ./MueLu_Driver.exe-prof --linAlgebra=Tpetra --nx=100 --ny=100 --xml=sa_with_ilu.xml --notimings
crtrott commented 5 years ago

Now I also see the FixedHashTable error occasionally.

[1,4]<stdout>:KokkosP: Driver: S - Global Time
[1,4]<stdout>:KokkosP:   Driver: 1 - Matrix Build
[1,4]<stdout>:KokkosP:     Kokkos::View::initialization
[1,4]<stdout>:KokkosP: Execution of kernel 0 is completed.
[1,4]<stdout>:KokkosP: Allocate<CudaUVM> name: nonContigGids pointer: 0x7fff20002880 size: 9600
[1,4]<stdout>:KokkosP: Executing parallel-for kernel on device 0 with unique execution identifier 1
[1,4]<stdout>:KokkosP: Driver: S - Global Time
[1,4]<stdout>:KokkosP:   Driver: 1 - Matrix Build
[1,4]<stdout>:KokkosP:     Kokkos::View::initialization
[1,4]<stdout>:KokkosP: Execution of kernel 1 is completed.
[1,4]<stdout>:KokkosP: Allocate<CudaUVM> name: FixedHashTable::counts pointer: 0x7fff20004e80 size: 6176
[1,4]<stdout>:KokkosP: Executing parallel-for kernel on device 0 with unique execution identifier 2
[1,4]<stdout>:KokkosP: Driver: S - Global Time
[1,4]<stdout>:KokkosP:   Driver: 1 - Matrix Build
[1,4]<stdout>:KokkosP:     Kokkos::View::initialization
[1,4]<stdout>:KokkosP: Execution of kernel 2 is completed.
[1,4]<stdout>:KokkosP: Executing parallel-for kernel on device 0 with unique execution identifier 3
[1,4]<stdout>:KokkosP: Driver: S - Global Time
[1,4]<stdout>:KokkosP:   Driver: 1 - Matrix Build
[1,4]<stdout>:KokkosP:     N6Tpetra7Details3FHT12CountBucketsIN6Kokkos4ViewIPiJNS3_10LayoutLeftENS3_6DeviceINS3_4CudaENS3_12CudaUVMSpaceEEENS3_12MemoryTraitsILj0EEEEEENS4_IPKxJS6_SA_EEEjEE
[1,4]<stdout>:KokkosP: Execution of kernel 3 is completed.
[1,4]<stdout>:KokkosP: Allocate<CudaUVM> name: FixedHashTable::ptr pointer: 0x7fff20006880 size: 6176
[1,4]<stdout>:KokkosP: Executing parallel-for kernel on device 0 with unique execution identifier 4
[1,4]<stdout>:KokkosP: Driver: S - Global Time
[1,4]<stdout>:KokkosP:   Driver: 1 - Matrix Build
[1,4]<stdout>:KokkosP:     Kokkos::View::initialization
[1,4]<stdout>:KokkosP: Execution of kernel 4 is completed.
[1,4]<stdout>:KokkosP: Executing parallel-scan kernel on device 0 with unique execution identifier 5
[1,4]<stdout>:KokkosP: Driver: S - Global Time
[1,4]<stdout>:KokkosP:   Driver: 1 - Matrix Build
[1,4]<stdout>:KokkosP:     Tpetra::Details::computeOffsetsFromCounts
[1,4]<stdout>:KokkosP: Execution of kernel 5 is completed.
[1,4]<stdout>:KokkosP: Allocate<CudaUVM> name: FixedHashTable::pairs pointer: 0x7fff20200080 size: 18416
[1,4]<stdout>:KokkosP: Executing parallel-reduce kernel on device 0 with unique execution identifier 6
[1,4]<stdout>:KokkosP: Driver: S - Global Time
[1,4]<stdout>:KokkosP:   Driver: 1 - Matrix Build
[1,4]<stdout>:KokkosP:     N6Tpetra7Details3FHT9FillPairsIN6Kokkos4ViewIPNS3_4pairIxiEEJNS3_10LayoutLeftENS3_6DeviceINS3_4CudaENS3_12CudaUVMSpaceEEENS3_12MemoryTraitsILj0EEEEEENS4_IPKxJS8_SC_EEENS4_IPiJS8_SC_SE_EEEjEE
[1,4]<stderr>::0: : block: [4,0,0], thread: [0,224,0] Assertion `View bounds error of view FixedHashTable::pairs[1,4]<stderr>:` failed.
crtrott commented 5 years ago

Note that if the failure happens inside of the FillPairs thing for the FixedHash, it never reaches the SPGEMM. So this may be that something goes wrong in that earlier thing, and then manifests later in SPGEMM as an invalid indexing.

jhux2 commented 5 years ago

@crtrott wrote

Now I also see the FixedHashTable error occasionally.

My screw-up, I built without the patch applied. Please recopy the exec from my build area.

jhux2 commented 5 years ago

I think what you are looking at is misleading. This is all just delayed error checking for CUDA Kernels. When using the profiling tool on the thing Jonathan run it crashes on two ranks with an illegal memory access inside of KokkosKernels SPGEMM inside the Laplace2d MueLu setup:

Ah -- that's real and consistent with the so-called error type 3 above.

crtrott commented 5 years ago

My gut feeling is that something goes wrong somewhere and as a consequence some column indicies in the local matrix are off, which down the line leads to memory access faults. Do we actually understand why the FixedHash thingy failed? Or did we just see a fast way to not trigger that code? Because if we don't understand why it failed, it might just be a symptom of the same root cause. And since in one of my tests without the patch the FixedHash thing failed before any SPGEMM got called, it might be easier to use that as the debugging starting point.

jhux2 commented 5 years ago

Do we actually understand why the FixedHash thingy failed? Or did we just see a fast way to not trigger that code?

That's a fair point. I don't know why the FixedHashTable occasionally fails. The patch is simply to avoid the path that triggers the type 1 error.

@bathmatt is still seeing failures in EMPIRE, even with Trilinos that includes the patch for the type 1 error.

mhoemmen commented 5 years ago

My gut feeling is that something goes wrong somewhere and as a consequence some column indices in the local matrix are off, which down the line leads to memory access faults.

^^^ we could extract a unit test from EMPIRE just to make sure.

kddevin commented 5 years ago

5735 is a more effective workaround for this issue. @trilino/tpetra still has some investigating to do, but this workaround seems robust enough for experimentation by @bathmatt.

Thanks, @jhux2. @srajama1

srajama1 commented 5 years ago

@kddevin @jhux2 : Thank you ! I appreciate your help.

bathmatt commented 5 years ago

I'm seeing errors some map code, trying to track it down. Not sure why it is happening. It is in an initialization step and hopefully I'll be able to trace it down to something simple.

bathmatt commented 5 years ago

I'm at this status, I don't believe that this is in a map, I believe it is in something in MPI.

@kddevin patches got me further to a similar looking bug, but now it looks like what is being sent isn't what is being received. I'll keep you posted on this though.

jhux2 commented 5 years ago

I'm running MueLu_Driver.exe on vortex with 8 MPI ranks (single node) using the branch that @kddevin, @mhoemmen, @crtrott, and I have been debugging with. None of the work-around ifdef's are enabled, i.e., this should be running stock Tpetra code. No errors yet, but I'll let it continue.

bathmatt commented 5 years ago

Do you have CUDA_MPI turned on???

jhux2 commented 5 years ago

If you mean Tpetra_ASSUME_CUDA_AWARE_MPI, that variable is set to no.

jhux2 commented 5 years ago

@mhoemmen Is there anyway to toggle that from the command line, or must I reconfigure?

jhux2 commented 5 years ago

I'm running MueLu_Driver.exe on vortex with 8 MPI ranks (single node) using the branch that @kddevin, @mhoemmen, @crtrott, and I have been debugging with. None of the work-around ifdef's are enabled, i.e., this should be running stock Tpetra code. No errors yet, but I'll let it continue.

I started a second job on two nodes and got it to fail after about 50 minutes. My runline is

jsrun -r4 -a1 -c4 -g1 -brs ./MueLu_Driver.exe --xml=sa_with_ilu.xml --notimings

This build has none of the temporary work-arounds enabled.

Update: The build is with -DTpetra_ASSUME_CUDA_AWARE_MPI:FALSE.

jhux2 commented 5 years ago

Unfortunately, there's no associated core file, so I have no idea where this crash happened.

mhoemmen commented 5 years ago

@jhux2 wrote:

Is there anyway to toggle that from the command line, or must I reconfigure?

Yes, you can set this at run time. Set the TPETRA_ASSUME_CUDA_AWARE_MPI environment variable to 1 (or 0, if you want it off).

jjellio commented 5 years ago

@jhux2

Is Vortex dumping lwcore files? If it does, those are just text files, and you can look inside to see a stack trace.

jhux2 commented 5 years ago

Do you have CUDA_MPI turned on???

@bathmatt Should I set TPETRA_ASSUME_CUDA_AWARE_MPI to be true?

jhux2 commented 5 years ago

Is Vortex dumping lwcore files? If it does, those are just text files, and you can look inside to see a stack trace.

There are no lwcore files in my run directory. Is there an LSF directive that might control this? Could they be elsewhere?

jhux2 commented 5 years ago

@mhoemmen Thanks, but I rebuilt a separate exec before I saw your response :(. The good news is that this guy drops core!

Backtrace, vortex, with -DTPETRA_ASSUME_CUDA_AWARE_MPI:BOOL=TRUE ``` #0 0x000020000e02b14c in __memcpy_power7 () from /lib64/libc.so.6 #1 0x000020000fb2f8cc in PAMI::Device::Shmem::Packet >::writePayload(PAMI::Fifo::FifoPacket<64u, 4096u>&, iovec*, unsigned long) () from /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/pami_port/libpami.so.3 #2 0x000020000fb4f8b8 in bool PAMI::Device::Interface::PacketModel, PAMI::Counter::IndirectBounded, 256u>, PAMI::Counter::Indirect, PAMI::Device::Shmem::CMAShaddr, 256u, 512u> > >::postPacket<2u>(unsigned long, unsigned long, void*, unsigned long, iovec (&) [2u]) () from /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/pami_port/libpami.so.3 #3 0x000020000fb504c4 in PAMI::Protocol::Send::EagerSimple, PAMI::Counter::IndirectBounded, 256u>, PAMI::Counter::Indirect, PAMI::Device::Shmem::CMAShaddr, 256u, 512u> >, (PAMI::Protocol::Send::configuration_t)5>::immediate_impl(pami_send_immediate_t*) () from /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/pami_port/libpami.so.3 #4 0x000020000fb5084c in PAMI::Protocol::Send::Eager, PAMI::Counter::IndirectBounded, 256u>, PAMI::Counter::Indirect, PAMI::Device::Shmem::CMAShaddr, 256u, 512u> >, PAMI::Device::IBV::PacketModel >::EagerImpl<(PAMI::Protocol::Send::configuration_t)5, true>::immediate(pami_send_immediate_t*) () from /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/pami_port/libpami.so.3 #5 0x000020000fa915b4 in PAMI_Send_immediate () from /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/pami_port/libpami.so.3 #6 0x000020000f8dfa8c in mca_pml_pami_send () from /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/spectrum_mpi/mca_pml_pami.so #7 0x000020000dbaf918 in PMPI_Send () from /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpi_ibm.so.3 #8 0x0000000014cad1c4 in Teuchos::(anonymous namespace)::sendImpl (sendBuffer=0x20007f200480, count=25, destRank=1, tag=0, comm=...) at /ascldap/users/jhu/trilinos/Trilinos/packages/teuchos/comm/src/Teuchos_CommHelpers.cpp:705 #9 0x0000000014ca8af0 in Teuchos::send (sendBuffer=0x20007f200480, count=25, destRank=1, tag=0, comm=...) at /ascldap/users/jhu/trilinos/Trilinos/packages/teuchos/comm/src/Teuchos_CommHelpers.cpp:1032 #10 0x00000000144bb734 in Teuchos::send, Kokkos::MemoryTraits<0u> > > (sendBuffer=..., count=25, destRank=1, tag=0, comm=...) at /ascldap/users/jhu/trilinos/Trilinos/packages/teuchos/kokkoscomm/src/Kokkos_TeuchosCommAdapters.hpp:71 #11 0x00000000144b5fb0 in Tpetra::Distributor::doPosts, Kokkos::MemoryTraits<0u> >, Kokkos::View, void, void> > ( this=0x3fda61d0, exports=..., numPackets=1, imports=...) at /ascldap/users/jhu/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_Distributor.hpp:2329 #12 0x00000000144a676c in Tpetra::Distributor::doPostsAndWaits, Kokkos::MemoryTraits<0u> >, Kokkos::View, void, void> > ( this=0x3fda61d0, exports=..., numPackets=1, imports=...) at /ascldap/users/jhu/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_Distributor.hpp:2024 #13 0x000000001456d4b0 in Tpetra::DistObject >::doTransferNew (this=0x3ff08d90, src=..., CM=Tpetra::INSERT, numSameIDs=1250, permuteToLIDs=..., permuteFromLIDs=..., remoteLIDs=..., exportLIDs=..., distor=..., revOp=Tpetra::DistObject >::DoForward, commOnHost=false, restrictedMode=false) at /ascldap/users/jhu/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_DistObject_def.hpp:1243 #14 0x00000000145662bc in Tpetra::DistObject >::doTransfer (this=0x3ff08d90, src=..., transfer=..., modeString=0x7ffff74d6fc8 "doImport (forward mode)", revOp=Tpetra::DistObject >::DoForward, CM=Tpetra::INSERT, restrictedMode=false) at /ascldap/users/jhu/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_DistObject_def.hpp:606 #15 0x000000001456255c in Tpetra::DistObject >::doImport (this=0x3ff08d90, source=..., importer=..., CM=Tpetra::INSERT, restrictedMode=false) at /ascldap/users/jhu/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_DistObject_def.hpp:305 #16 0x00000000141db988 in Tpetra::CrsMatrix >::applyNonTranspose (this=0x3fcf4710, X_in=..., Y_in=..., alpha=1, beta=0) at /ascldap/users/jhu/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_CrsMatrix_def.hpp:5440 #17 0x00000000141afad0 in Tpetra::CrsMatrix >::apply (this=0x3fcf4710, X=..., Y=..., mode=Teuchos::NO_TRANS, alpha=1, beta=0) at /ascldap/users/jhu/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_CrsMatrix_def.hpp:5745 #18 0x0000000012e698ac in Xpetra::TpetraCrsMatrix >::apply (this=0x3fd10ca0, X=..., Y=..., mode=Teuchos::NO_TRANS, alpha=1, beta=0) at /ascldap/users/jhu/trilinos/Trilinos/packages/xpetra/src/CrsMatrix/Xpetra_TpetraCrsMatrix_def.hpp:335 #19 0x0000000012e2a8d0 in Xpetra::CrsMatrixWrap >::apply (this=0x3fce8280, X=..., Y=..., mode=Teuchos::NO_TRANS, alpha=1, beta=0) at /ascldap/users/jhu/trilinos/Trilinos/packages/xpetra/sup/Matrix/Xpetra_CrsMatrixWrap_def.hpp:340 #20 0x00000000100ed184 in MatrixLoad > (comm=..., lib=@0x7ffff74d90fc: Xpetra::UseTpetra, binaryFormat=false, matrixFile="", rhsFile="", rowMapFile="", colMapFile="", domainMapFile="", rangeMapFile="", coordFile="", nullFile="", map=..., A=..., coordinates=..., nullspace=..., X=..., B=..., numVectors=1, galeriParameters=..., xpetraParameters=..., galeriStream=...) at /ascldap/users/jhu/trilinos/Trilinos/packages/muelu/test/scaling/MatrixLoad.hpp:207 #21 0x00000000100d2fd8 in main_ > ( clp=..., lib=@0x7ffff74d90fc: Xpetra::UseTpetra, argc=3, argv=0x7ffff74d98b8) at /ascldap/users/jhu/trilinos/Trilinos/packages/muelu/test/scaling/Driver.cpp:353 #22 0x00000000100bb870 in Automatic_Test_ETI (argc=3, argv=0x7ffff74d98b8) at /ascldap/users/jhu/trilinos/Trilinos/packages/muelu/test/scaling/../unit_tests/MueLu_Test_ETI.hpp:160 #23 0x00000000100bca00 in main (argc=3, argv=0x7ffff74d98b8) at /ascldap/users/jhu/trilinos/Trilinos/packages/muelu/test/scaling/Driver.cpp:613 ```
mhoemmen commented 5 years ago

@jhux2 It looks like this Spectrum MPI wasn't built with CUDA support. Does it have the equivalent of ompi_info? If so, could you query it to see if it has the correct CUDA support?

jhux2 commented 5 years ago

Ok, heard back that @bathmatt does not have TPETRA_ASSUME_CUDA_AWARE_MPI enabled.

jhux2 commented 5 years ago

@mhoemmen Is this what you mean?

(/vscratch1/jhu/lets-dump-core) ompi_info | grep -i cuda
          MPI extensions: affinity, cuda
mhoemmen commented 5 years ago

@jhux2 That could be, though I'm not sure whether that's the right ompi_info executable for Spectrum MPI. Tpetra uses the following command with OpenMPI:

ompi_info --parsable --all | grep "mpi_built_with_cuda_support:value"

and it should print something like this:

mca:mpi:base:param:mpi_built_with_cuda_support:value:true

Replace "true" with "false" if that installation of OpenMPI was not built with CUDA support.