Closed fryeguy52 closed 4 years ago
EMPIRE was seeing the integer divide by 0, it was when creating a view, there was a long discussino on the kokkos IM channel. It came and went and I believe was a cuda compiler bug. DavidH was looking at it as well
We can follow up with NVIDIA if we have an example to reproduce this.
@srajama1
We can follow up with NVIDIA if we have an example to reproduce this.
Does someone not need to isolate the code in Trilinos that is triggering this first before someone can create a reproducer for NVIDA? Currently I don't think we know what code is triggering this. Having stack traces should be a good start.
I'm having a similar issue with (Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete) but its not random. Its every time. I reduced the size of the mesh and fillComplete did not throw an error.
@kddevin @srajama1 - @bathmatt ping you on a few issues. This is the original ticket with more information about the failures across the packages.
We're stuck on trying to get a TPL into Kokkos. I need some advice from someone like @ibaned or @ndellingwood on that.
Is there a Kokkos issue corresponding to that ?
There's a Kokkos PR: https://github.com/kokkos/kokkos/pull/2226 . We don't want to merge just yet because backtrace*
are not POSIX standard functions. This is why we need to figure out how to add a Kokkos TPL. The usual TriBITS TPL mechanism doesn't work, because KokkosCore_config.h
doesn't use the usual TriBITS header file generation process.
The MueLu tests MueLu_DriverTpetraILU_MPI_4
, MueLu_DriverTpetra_WithGlobalConstants_MPI_4
, and MueLu_UnitTestsTpetra_MPI_4
are failing randomly and frequently with the error
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:120
Traceback functionality not available
There are a few other MueLu tests that fail the same way, but less frequently.
Sometimes MueLu_UnitTestsTpetra_MPI_4
fails instead with an error like
:0: : block: [9,0,0], thread: [0,224,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
Here's the search query I used.
I have been able to reproduce on waterman in the MueLu scaling driver "Driver.cpp". In that code, the error manifests either during the initial matrix map construction or during the matrix construction. Here is the stack trace with Trilinos dev, SHA 6550bd788b, with some minor modifications to Driver.cpp to make reproducing easier.
#0 0x00007fff7572faf0 in raise () from /lib64/libc.so.6
#1 0x00007fff75731e6c in abort () from /lib64/libc.so.6
#2 0x00007fff759d0774 in __gnu_cxx::__verbose_terminate_handler () at ../../.././libstdc++-v3/libsupc++/vterminate.cc:95
#3 0x00007fff759cb504 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../.././libstdc++-v3/libsupc++/eh_terminate.cc:47
#4 0x00007fff759c9928 in __cxa_call_terminate (ue_header=0x33f90950) at ../../.././libstdc++-v3/libsupc++/eh_call.cc:54
#5 0x00007fff759caaec in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=<optimized out>, exception_class=<optimized out>, ue_header=0x33f90950, context=0x7ffff2b4c770) at ../../.././libstdc++-v3/libsupc++/eh_personality.cc:676
#6 0x00007fff758ec084 in _Unwind_RaiseException_Phase2 (exc=0x33f90950, context=0x7ffff2b4c770) at ../.././libgcc/unwind.inc:62
#7 0x00007fff758ecc04 in _Unwind_Resume (exc=0x33f90950) at ../.././libgcc/unwind.inc:230
#8 0x0000000012e571cc in Kokkos::Impl::ViewValueFunctor<Kokkos::Cuda, Kokkos::pair<long long, int>, false>::execute (this=0x339678e0, arg=true) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/impl/Kokkos_ViewMapping.hpp:2535
#9 0x0000000012e5b01c in Kokkos::Impl::ViewValueFunctor<Kokkos::Cuda, Kokkos::pair<long long, int>, false>::destroy_shared_allocation (this=0x339678e0) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/impl/Kokkos_ViewMapping.hpp:2553
#10 0x0000000012e599b4 in Kokkos::Impl::(anonymous namespace)::deallocate<Kokkos::CudaUVMSpace, Kokkos::Impl::ViewValueFunctor<Kokkos::Cuda, Kokkos::pair<long long, int>, false> > (record_ptr=0x33967890) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/impl/Kokkos_SharedAlloc.hpp:196
#11 0x0000000014ca4eac in Kokkos::Impl::SharedAllocationRecord<void, void>::decrement (arg_record=0x33967890) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/impl/Kokkos_SharedAlloc.cpp:273
#12 0x0000000012e363b4 in ~SharedAllocationTracker (this=0x7ffff2b4db50, __in_chrg=<optimized out>) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/impl/Kokkos_SharedAlloc.hpp:358
#13 Kokkos::View<Kokkos::pair<long long, int>*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>, Kokkos::MemoryTraits<0u> >::~View (this=0x7ffff2b4db50, __in_chrg=<optimized out>) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/Kokkos_View.hpp:1972
#14 0x0000000012e237fc in Tpetra::Details::FixedHashTable<long long, int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::init (this=0x7ffff2b4df50, keys=..., startingValue=50, initMinKey=5150, initMaxKey=5199, firstContigKey=5050, lastContigKey=5099, computeInitContigKeys=true) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_Details_FixedHashTable_def.hpp:1188
#15 0x0000000012e20828 in Tpetra::Details::FixedHashTable<long long, int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::FixedHashTable (this=0x7ffff2b4df50, keys=..., firstContigKey=5050, lastContigKey=5099, startingValue=50, keepKeys=false) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_Details_FixedHashTable_def.hpp:803
#16 0x0000000012f35868 in Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::initWithNonownedHostIndexList (this=0x33dbab10, numGlobalElements=10000, entryList_host=..., indexBase=0, comm=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_Map_def.hpp:695
#17 0x0000000012f2eb5c in Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::Map (this=0x33dbab10, numGlobalElements=10000, entryList=..., indexBase=0, comm=..., __in_chrg=<optimized out>, __vtt_parm=<optimized out>) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_Map_def.hpp:866
#18 0x000000001289d3d8 in Xpetra::TpetraMap<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::TpetraMap (this=0x33f904c0, numGlobalElements=10000, elementList=..., indexBase=0, comm=..., __in_chrg=<optimized out>, __vtt_parm=<optimized out>) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/xpetra/src/Map/Xpetra_TpetraMap_def.hpp:133
#19 0x0000000010148edc in Galeri::Xpetra::MapTraits<long long, Xpetra::TpetraMap<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > >::Build (numGlobalElements=10000, elementList=..., indexBase=0, comm=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/galeri/src-xpetra/Galeri_MapTraits.hpp:123
#20 0x0000000010120990 in Galeri::Xpetra::Maps::Cartesian2D<int, long long, Xpetra::TpetraMap<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > > (comm=..., nx=100, ny=100, mx=2, my=2, list=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/galeri/src-xpetra/Galeri_XpetraCartesian.hpp:147
#21 0x00000000100f7c88 in Galeri::Xpetra::CreateMap<int, long long, Xpetra::TpetraMap<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > > (mapType="Cartesian2D", comm=..., list=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/galeri/src-xpetra/Galeri_XpetraMaps.hpp:256
#22 0x00000000100d4138 in Galeri::Xpetra::CreateMap<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > (lib=Xpetra::UseTpetra, mapType="Cartesian2D", comm=..., list=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/galeri/src-xpetra/Galeri_XpetraMaps.hpp:146
#23 0x00000000100bf8f0 in MatrixLoad<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > (comm=..., lib=@0x7ffff2b52514: Xpetra::UseTpetra, binaryFormat=false, matrixFile="", rhsFile="", rowMapFile="", colMapFile="", domainMapFile="", rangeMapFile="", coordFile="", nullFile="", map=..., A=..., coordinates=..., nullspace=..., X=..., B=..., numVectors=1, galeriParameters=..., xpetraParameters=..., galeriStream=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/MatrixLoad.hpp:125
#24 0x00000000100ade74 in main_<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > (clp=..., lib=@0x7ffff2b52514: Xpetra::UseTpetra, argc=1, argv=0x7ffff2b52ca8) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/Driver.cpp:353
#25 0x000000001009f494 in Automatic_Test_ETI (argc=1, argv=0x7ffff2b52ca8) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/../unit_tests/MueLu_Test_ETI.hpp:162
#26 0x00000000100a0534 in main (argc=1, argv=0x7ffff2b52ca8) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/Driver.cpp:604
To reproduce on waterman:
Apply the attached patch to Trilinos dev. I'm using SHA 6550bd788b.
Configure using the attached script, then build as usual.
bsub -x -Is -n 20 bash
mpiexec -np 5 -map-by socket:PE=4 MueLu_Driver.exe
This should run fine, but seems necessary to trigger the failure on 4 MPI ranks. After the PIDs print and pressing a key, let it run for about 10 seconds, then control-c.
mpiexec -np 4 -map-by socket:PE=4 MueLu_Driver.exe
Follow the screen command to attach gdb to the processes. It should fail almost immediately. If it doesn't fail, run the 5-rank example again, followed by this one.
I saw lots of errors in this section when I compiled with -fsanitize and openmp. It was use of stack data after function or some such stuff.
kokkos has a lot of these warnings, not sure if they are real or not.
I added an extra fence in Tpetra_Details_FixedHashTable_def.hpp
on line 1182, and the errors of the form
:0: : block: [3,0,0], thread: [0,151,0] Assertion `View bounds error of view FixedHashTable::pairs` failed.
go away. Changing line 1164 from if (buildInParallel)
to if (false)
also seems to make these types of errors go away.
CORRECTION: The extra fence has no effect. Changing line 1164 does make the error go away.
I'm still seeing another type of error during FillComplete of the matrix in MueLu's Driver.cpp
. Here is that backtrace.
#0 0x000000001402a678 in Tpetra::Distributor::doPosts<Kokkos::View<char const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::View<char*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> > (this=0x6fbe2c30, exports=..., numExportPacketsPerLID=..., imports=..., numImportPacketsPerLID=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_Distributor.hpp:2771
#1 0x0000000014024a30 in Tpetra::Distributor::doPostsAndWaits<Kokkos::View<char const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::View<char*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> > (this=0x6fbe2c30, exports=..., numExportPacketsPerLID=..., imports=..., numImportPacketsPerLID=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_Distributor.hpp:2049
#2 0x0000000014020704 in Tpetra::DistObject<char, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doTransferNew (this=0x6fb75a08, src=..., CM=Tpetra::ADD, numSameIDs=0, permuteToLIDs=..., permuteFromLIDs=..., remoteLIDs=..., exportLIDs=..., distor=..., revOp=Tpetra::DistObject<char, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::DoForward, commOnHost=false, restrictedMode=false) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_DistObject_def.hpp:1184
#3 0x0000000014019bc8 in Tpetra::DistObject<char, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doTransfer (this=0x6fb75a08, src=..., transfer=..., modeString=0x7ffff48c1238 "doExport (forward mode)", revOp=Tpetra::DistObject<char, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::DoForward, CM=Tpetra::ADD, restrictedMode=false) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_DistObject_def.hpp:606
#4 0x000000001401631c in Tpetra::DistObject<char, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doExport (this=0x6fb75a08, source=..., exporter=..., CM=Tpetra::ADD, restrictedMode=false) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_DistObject_def.hpp:347
#5 0x0000000013b78a74 in Tpetra::CrsMatrix<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::globalAssemble (this=0x6fb75a00) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_CrsMatrix_def.hpp:4850
#6 0x0000000013b79a18 in Tpetra::CrsMatrix<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete (this=0x6fb75a00, domainMap=..., rangeMap=..., params=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_CrsMatrix_def.hpp:5039
#7 0x0000000013b7ab14 in Tpetra::CrsMatrix<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete (this=0x6fb75a00, params=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/tpetra/core/src/Tpetra_CrsMatrix_def.hpp:4974
#8 0x00000000128c2040 in Xpetra::TpetraCrsMatrix<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete (this=0x6fb759a0, params=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/xpetra/src/CrsMatrix/Xpetra_TpetraCrsMatrix_def.hpp:232
#9 0x00000000128857b0 in Xpetra::CrsMatrixWrap<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete (this=0x6fb74f10, params=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/xpetra/sup/Matrix/Xpetra_CrsMatrixWrap_def.hpp:213
#10 0x0000000010235ef8 in Galeri::Xpetra::Cross2D<double, int, long long, Xpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >, Xpetra::CrsMatrixWrap<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > > (map=..., nx=100, ny=100, a=4, b=-1, c=-1, d=-1, e=-1, DirichletBC=63, keepBCs=false) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/galeri/src-xpetra/Galeri_XpetraMatrixTypes.hpp:272
#11 0x0000000010208a44 in Galeri::Xpetra::Laplace2DProblem<double, int, long long, Xpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >, Xpetra::CrsMatrixWrap<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >, Xpetra::MultiVector<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > >::BuildMatrix (this=0x6fb749a0) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/galeri/src-xpetra/Galeri_StencilProblems.hpp:146
#12 0x00000000100c0208 in MatrixLoad<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > (comm=..., lib=@0x7ffff48c4d64: Xpetra::UseTpetra, binaryFormat=false, matrixFile="", rhsFile="", rowMapFile="", colMapFile="", domainMapFile="", rangeMapFile="", coordFile="", nullFile="", map=..., A=..., coordinates=..., nullspace=..., X=..., B=..., numVectors=1, galeriParameters=..., xpetraParameters=..., galeriStream=...) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/MatrixLoad.hpp:155
#13 0x00000000100adbbc in main_<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > (clp=..., lib=@0x7ffff48c4d64: Xpetra::UseTpetra, argc=6, argv=0x7ffff48c5428) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/Driver.cpp:352
#14 0x000000001009f4d8 in Automatic_Test_ETI (argc=6, argv=0x7ffff48c5428) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/../unit_tests/MueLu_Test_ETI.hpp:160
#15 0x00000000100a029c in main (argc=6, argv=0x7ffff48c5428) at /ascldap/users/jhu/software/trilinos/Trilinos/packages/muelu/test/scaling/Driver.cpp:571
Here's a third type of error:
@jhux2 How do you produce the second and third type of error above once you modify the fixed hash table?
@kddevin Here are some instructions on reproducing the type 2/3 errors on waterman:
bsub -x -Is -n 20 bash
mpiexec -np 5 -map-by socket:PE=4 MueLu_Driver.exe --linAlgebra=Tpetra --nx=100 --ny=100 --xml=sa_with_ilu.xml --notimings
This is designed to loop infinitely. Just let it run 10 seconds or so, then control-c.mpiexec -np 4 -map-by socket:PE=4 MueLu_Driver.exe --linAlgebra=Tpetra --nx=100 --ny=100 --xml=sa_with_ilu.xml --notimings --gdb
This will pause to let you attach debuggers, then run infinitely. Eventually, an error will occur. In my experiments, this could be several minutes, maybe even 10s of minutes.I submitted PR #5715 that provides a temporary workaround for the most reproducible of these errors. We'll reverse the workaround as we understand the issue better. Until then, it may be worth trying in the application to determine whether it allows the application to make progress. @bathmatt
@bartlettroscoe Is there an easy way for me to build these tests with a different compiler and/or CUDA version on waterman? Since the tests pass with many other compilers on other platforms, I'd just like to try them on waterman with a different configuration. Thanks.
@kddevin asked:
Is there an easy way for me to build these tests with a different compiler and/or CUDA version on waterman?
You can edit the files under:
Trilinos/cmake/std/atdm/waterman/
locally and put in whatever you want.
I am also working on #4933 that will allow you to set up and load any env you want.
Just be warned that you may have issues with the TPLs needed if you pick a compiler or options that don't already work with TPLs already installed. We were hoping to be able to do that with Spack but it has been going very slowly and we can't do that yet on 'waterman'.
@bartlettroscoe Thanks. I am terrible at getting all the TPLs, etc., aligned. I was hoping that you had, say, just one other configuration that you knew "worked" and that I could load easily. I'm not picky, as long as it is different from the one used here. Do you have anything like that?
@kddevin, specifically, what compilers/configurations do you want to try that are not already supported in:
Trilinos/cmake/std/atdm/waterman/environment.sh
?
We don't test or support many different configurations because builds are expensive, and the APPs don't need them. We only try to support just what the APPs need (and even struggle with just that).
@kddevin I talked with @crtrott, who recommended this option:
-DKokkos_ENABLE_Profiling:ON
It's explicitly turned off on the dashboard. To enable it, you much toggle its value in /ascldap/users/jhu/software/trilinos/Trilinos/cmake/std/atdm/ATDMDevEnvSettings.cmake
.
With this option enabled, I'm seeing an assertion right away in FixedHashTable.
I was mistaken and did not have the FixedHashTable patch applied. With the patch applied, I've not seen any assertions yet.
I think what you are looking at is misleading. This is all just delayed error checking for CUDA Kernels. When using the profiling tool on the thing Jonathan run it crashes on two ranks with an illegal memory access inside of KokkosKernels SPGEMM inside the Laplace2d MueLu setup:
[1,1]<stdout>:KokkosP: Allocate<CudaUVM> name: entriesC pointer: 0x7fff0002a080 size: 8128
[1,1]<stdout>:KokkosP: Allocate<CudaUVM> name: valuesC pointer: 0x7fff00030280 size: 16256
[1,1]<stdout>:KokkosP: Allocate<CudaUVM> name: pool data pointer: 0x7fff00a00080 size: 6029312
[1,1]<stdout>:KokkosP: Allocate<CudaUVM> name: locks pointer: 0x7fff00815080 size: 131072
[1,1]<stdout>:KokkosP: Executing parallel-for kernel on device 0 with unique execution identifier 1137
[1,1]<stdout>:KokkosP: Driver: S - Global Time
[1,1]<stdout>:KokkosP: timername
[1,1]<stdout>:KokkosP: MueLu setup time (Laplace2D)
[1,1]<stdout>:KokkosP: Kokkos::View::initialization
[1,1]<stdout>:KokkosP: Execution of kernel 1137 is completed.
[1,1]<stdout>:KokkosP: Executing parallel-for kernel on device 0 with unique execution identifier 1138
[1,1]<stdout>:KokkosP: Driver: S - Global Time
[1,1]<stdout>:KokkosP: timername
[1,1]<stdout>:KokkosP: MueLu setup time (Laplace2D)
[1,1]<stdout>:KokkosP: Kokkos::ViewFill-1D
[1,1]<stdout>:KokkosP: Execution of kernel 1138 is completed.
[1,1]<stdout>:KokkosP: Executing parallel-for kernel on device 0 with unique execution identifier 1139
[1,1]<stdout>:KokkosP: Driver: S - Global Time
[1,1]<stdout>:KokkosP: timername
[1,1]<stdout>:KokkosP: MueLu setup time (Laplace2D)
[1,1]<stdout>:KokkosP: KOKKOSPARSE::SPGEMM::SPGEMM_KK_MEMORY
[1,1]<stderr>:terminate called after throwing an instance of 'std::runtime_error'
[1,1]<stderr>: what(): cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /ascldap/users/jhu/software/trilinos/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:120
[1,1]<stderr>:Traceback functionality not available
[1,1]<stderr>:
[1,1]<stderr>:[waterman1:136183] *** Process received signal ***
[1,1]<stderr>:[waterman1:136183] Signal: Aborted (6)
[1,1]<stderr>:[waterman1:136183] Signal code: (-6)
[1,1]<stderr>:[waterman1:136183] [ 0] [0x7fff834604d8]
[1,1]<stderr>:[waterman1:136183] [ 1] [1,1]<stderr>:/lib64/libc.so.6(abort+0x2b4)[0x7fff75b91f94]
[1,1]<stderr>:[waterman1:136183] [ 2] [1,1]<stderr>:/home/projects/ppc64le/gcc/7.2.0/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x1c4)[0x7fff75e30774]
[1,1]<stderr>:[waterman1:136183] [ 3] /home/projects/ppc64le/gcc/7.2.0/lib64/libstdc++.so.6(+0xab504)[0x7fff75e2b504]
[1,1]<stderr>:[waterman1:136183] [ 4] /home/projects/ppc64le/gcc/7.2.0/lib64/libstdc++.so.6(+0xa9928)[0x7fff75e29928]
[1,1]<stderr>:[waterman1:136183] [ 5] [1,1]<stderr>:/home/projects/ppc64le/gcc/7.2.0/lib64/libstdc++.so.6(__gxx_personality_v0+0x52c)[0x7fff75e2aaec]
[1,1]<stderr>:[waterman1:136183] [ 6] /home/projects/ppc64le/gcc/7.2.0/lib64/libgcc_s.so.1(+0xc084)[0x7fff75d4c084]
[1,1]<stderr>:[waterman1:136183] [ 7] /home/projects/ppc64le/gcc/7.2.0/lib64/libgcc_s.so.1(_Unwind_Resume+0x174)[0x7fff75d4cc04]
[1,1]<stderr>:[waterman1:136183] [1,1]<stderr>:[ 8] ./MueLu_Driver.exe-prof[0x14f32be0]
[1,1]<stderr>:[waterman1:136183] [ 9] ./MueLu_Driver.exe-prof[0x100b83bc]
[1,1]<stderr>:[waterman1:136183] [10] ./MueLu_Driver.exe-prof[0x14f329e8]
[1,1]<stderr>:[waterman1:136183] [11] ./MueLu_Driver.exe-prof[0x14f35750]
[1,1]<stderr>:[waterman1:136183] [12] ./MueLu_Driver.exe-prof[0x14f389d0]
[1,1]<stderr>:[waterman1:136183] [13] ./MueLu_Driver.exe-prof[0x10798410]
I am running with:
export KOKKOS_NUM_DEVICES=1
mpiexec -np 8 --tag-output ./MueLu_Driver.exe-prof --linAlgebra=Tpetra --nx=100 --ny=100 --xml=sa_with_ilu.xml --notimings
Now I also see the FixedHashTable error occasionally.
[1,4]<stdout>:KokkosP: Driver: S - Global Time
[1,4]<stdout>:KokkosP: Driver: 1 - Matrix Build
[1,4]<stdout>:KokkosP: Kokkos::View::initialization
[1,4]<stdout>:KokkosP: Execution of kernel 0 is completed.
[1,4]<stdout>:KokkosP: Allocate<CudaUVM> name: nonContigGids pointer: 0x7fff20002880 size: 9600
[1,4]<stdout>:KokkosP: Executing parallel-for kernel on device 0 with unique execution identifier 1
[1,4]<stdout>:KokkosP: Driver: S - Global Time
[1,4]<stdout>:KokkosP: Driver: 1 - Matrix Build
[1,4]<stdout>:KokkosP: Kokkos::View::initialization
[1,4]<stdout>:KokkosP: Execution of kernel 1 is completed.
[1,4]<stdout>:KokkosP: Allocate<CudaUVM> name: FixedHashTable::counts pointer: 0x7fff20004e80 size: 6176
[1,4]<stdout>:KokkosP: Executing parallel-for kernel on device 0 with unique execution identifier 2
[1,4]<stdout>:KokkosP: Driver: S - Global Time
[1,4]<stdout>:KokkosP: Driver: 1 - Matrix Build
[1,4]<stdout>:KokkosP: Kokkos::View::initialization
[1,4]<stdout>:KokkosP: Execution of kernel 2 is completed.
[1,4]<stdout>:KokkosP: Executing parallel-for kernel on device 0 with unique execution identifier 3
[1,4]<stdout>:KokkosP: Driver: S - Global Time
[1,4]<stdout>:KokkosP: Driver: 1 - Matrix Build
[1,4]<stdout>:KokkosP: N6Tpetra7Details3FHT12CountBucketsIN6Kokkos4ViewIPiJNS3_10LayoutLeftENS3_6DeviceINS3_4CudaENS3_12CudaUVMSpaceEEENS3_12MemoryTraitsILj0EEEEEENS4_IPKxJS6_SA_EEEjEE
[1,4]<stdout>:KokkosP: Execution of kernel 3 is completed.
[1,4]<stdout>:KokkosP: Allocate<CudaUVM> name: FixedHashTable::ptr pointer: 0x7fff20006880 size: 6176
[1,4]<stdout>:KokkosP: Executing parallel-for kernel on device 0 with unique execution identifier 4
[1,4]<stdout>:KokkosP: Driver: S - Global Time
[1,4]<stdout>:KokkosP: Driver: 1 - Matrix Build
[1,4]<stdout>:KokkosP: Kokkos::View::initialization
[1,4]<stdout>:KokkosP: Execution of kernel 4 is completed.
[1,4]<stdout>:KokkosP: Executing parallel-scan kernel on device 0 with unique execution identifier 5
[1,4]<stdout>:KokkosP: Driver: S - Global Time
[1,4]<stdout>:KokkosP: Driver: 1 - Matrix Build
[1,4]<stdout>:KokkosP: Tpetra::Details::computeOffsetsFromCounts
[1,4]<stdout>:KokkosP: Execution of kernel 5 is completed.
[1,4]<stdout>:KokkosP: Allocate<CudaUVM> name: FixedHashTable::pairs pointer: 0x7fff20200080 size: 18416
[1,4]<stdout>:KokkosP: Executing parallel-reduce kernel on device 0 with unique execution identifier 6
[1,4]<stdout>:KokkosP: Driver: S - Global Time
[1,4]<stdout>:KokkosP: Driver: 1 - Matrix Build
[1,4]<stdout>:KokkosP: N6Tpetra7Details3FHT9FillPairsIN6Kokkos4ViewIPNS3_4pairIxiEEJNS3_10LayoutLeftENS3_6DeviceINS3_4CudaENS3_12CudaUVMSpaceEEENS3_12MemoryTraitsILj0EEEEEENS4_IPKxJS8_SC_EEENS4_IPiJS8_SC_SE_EEEjEE
[1,4]<stderr>::0: : block: [4,0,0], thread: [0,224,0] Assertion `View bounds error of view FixedHashTable::pairs[1,4]<stderr>:` failed.
Note that if the failure happens inside of the FillPairs thing for the FixedHash, it never reaches the SPGEMM. So this may be that something goes wrong in that earlier thing, and then manifests later in SPGEMM as an invalid indexing.
@crtrott wrote
Now I also see the FixedHashTable error occasionally.
My screw-up, I built without the patch applied. Please recopy the exec from my build area.
I think what you are looking at is misleading. This is all just delayed error checking for CUDA Kernels. When using the profiling tool on the thing Jonathan run it crashes on two ranks with an illegal memory access inside of KokkosKernels SPGEMM inside the Laplace2d MueLu setup:
Ah -- that's real and consistent with the so-called error type 3 above.
My gut feeling is that something goes wrong somewhere and as a consequence some column indicies in the local matrix are off, which down the line leads to memory access faults. Do we actually understand why the FixedHash thingy failed? Or did we just see a fast way to not trigger that code? Because if we don't understand why it failed, it might just be a symptom of the same root cause. And since in one of my tests without the patch the FixedHash thing failed before any SPGEMM got called, it might be easier to use that as the debugging starting point.
Do we actually understand why the FixedHash thingy failed? Or did we just see a fast way to not trigger that code?
That's a fair point. I don't know why the FixedHashTable occasionally fails. The patch is simply to avoid the path that triggers the type 1 error.
@bathmatt is still seeing failures in EMPIRE, even with Trilinos that includes the patch for the type 1 error.
My gut feeling is that something goes wrong somewhere and as a consequence some column indices in the local matrix are off, which down the line leads to memory access faults.
^^^ we could extract a unit test from EMPIRE just to make sure.
Thanks, @jhux2. @srajama1
@kddevin @jhux2 : Thank you ! I appreciate your help.
I'm seeing errors some map code, trying to track it down. Not sure why it is happening. It is in an initialization step and hopefully I'll be able to trace it down to something simple.
I'm at this status, I don't believe that this is in a map, I believe it is in something in MPI.
@kddevin patches got me further to a similar looking bug, but now it looks like what is being sent isn't what is being received. I'll keep you posted on this though.
I'm running MueLu_Driver.exe
on vortex with 8 MPI ranks (single node) using the branch that @kddevin, @mhoemmen, @crtrott, and I have been debugging with. None of the work-around ifdef's are enabled, i.e., this should be running stock Tpetra code. No errors yet, but I'll let it continue.
Do you have CUDA_MPI turned on???
If you mean Tpetra_ASSUME_CUDA_AWARE_MPI
, that variable is set to no
.
@mhoemmen Is there anyway to toggle that from the command line, or must I reconfigure?
I'm running
MueLu_Driver.exe
on vortex with 8 MPI ranks (single node) using the branch that @kddevin, @mhoemmen, @crtrott, and I have been debugging with. None of the work-around ifdef's are enabled, i.e., this should be running stock Tpetra code. No errors yet, but I'll let it continue.
I started a second job on two nodes and got it to fail after about 50 minutes. My runline is
jsrun -r4 -a1 -c4 -g1 -brs ./MueLu_Driver.exe --xml=sa_with_ilu.xml --notimings
This build has none of the temporary work-arounds enabled.
Update: The build is with -DTpetra_ASSUME_CUDA_AWARE_MPI:FALSE
.
Unfortunately, there's no associated core file, so I have no idea where this crash happened.
@jhux2 wrote:
Is there anyway to toggle that from the command line, or must I reconfigure?
Yes, you can set this at run time. Set the TPETRA_ASSUME_CUDA_AWARE_MPI
environment variable to 1 (or 0, if you want it off).
@jhux2
Is Vortex dumping lwcore files? If it does, those are just text files, and you can look inside to see a stack trace.
Do you have CUDA_MPI turned on???
@bathmatt Should I set TPETRA_ASSUME_CUDA_AWARE_MPI
to be true?
Is Vortex dumping lwcore files? If it does, those are just text files, and you can look inside to see a stack trace.
There are no lwcore files in my run directory. Is there an LSF directive that might control this? Could they be elsewhere?
@mhoemmen Thanks, but I rebuilt a separate exec before I saw your response :(. The good news is that this guy drops core!
@jhux2 It looks like this Spectrum MPI wasn't built with CUDA support. Does it have the equivalent of ompi_info
? If so, could you query it to see if it has the correct CUDA support?
Ok, heard back that @bathmatt does not have TPETRA_ASSUME_CUDA_AWARE_MPI
enabled.
@mhoemmen Is this what you mean?
(/vscratch1/jhu/lets-dump-core) ompi_info | grep -i cuda
MPI extensions: affinity, cuda
@jhux2 That could be, though I'm not sure whether that's the right ompi_info
executable for Spectrum MPI. Tpetra uses the following command with OpenMPI:
ompi_info --parsable --all | grep "mpi_built_with_cuda_support:value"
and it should print something like this:
mca:mpi:base:param:mpi_built_with_cuda_support:value:true
Replace "true" with "false" if that installation of OpenMPI was not built with CUDA support.
Bug Report
CC: @trilinos/panzer, @kddevin (Trilinos Data Services Product Lead), @srajama1 (Trilinos Linear Solver Data Services), @mperego (Trilinos Discretizations Product Lead), @bartlettroscoe, @fryeguy52
Next Action Status
Since PR #5346 was merged on 6/7/2019 which fixed a file read/write race in the test, there has only been one failing Panzer test on any ATDM Trilinos platform as of 6/11/2019 looking to be related. Also, on 6/11/2019 @bathmatt reported EMPIRE is not failing in a similar way in his recent tests. Next: Watch results over next few weeks to see if more random failures like this occur ...
Description
As shown in this query the tests:
are failing in the build:
Additionally the test:
is failing in a different build on the same machine:
Expand to see new commits on 2019-05-14
``` *** Base Git Repo: Trilinos 7b6d69a: Merge remote-tracking branch 'origin/develop' into atdm-nightly Author: Roscoe A. BartlettCurrent Status on CDash
Results for the current testing day
Steps to Reproduce
One should be able to reproduce this failure on waterman as described in:
More specifically, the commands given for waterman are provided at:
The exact commands to reproduce this issue should be: