Open glhenni opened 2 months ago
@glhenni @vqd8a the 4.4 release included thread safety fixes that exposed issues with some incorrect usages of Views that showed up in a couple places in Trilinos resulting in a deadlock/hang of tests. The most common cases were due to View creation/destruction within parallel regions, often times with View-of-View's usage where creation and/or destruction were not properly handled. Based on your report and hanging tests, I suspect something similar might be occurring?
@glhenni a new tool is in progress that was very helpful in finding the View usage issues in Trilinos, https://github.com/kokkos/kokkos-tools/pull/267 , I suggest running the using test with this tool to see if any culprit usage is flagged
I think I am seeing this in Nalu.... However, my final bisect iteration does not actually build:
commit f8ff2ad41462ea8af664241df5044928799e5984 (HEAD) Author: Nathan Ellingwood ndellin@sandia.gov Date: Wed Aug 7 16:39:21 2024 -0600
stk: modify test to prevent allocation in parallel region
modify NgpMeshTest.volatileFastSharedCommMap to prevent allocation in a parallel region, which can result in deadlock with kokkos version 4.4
address issue #13328
Co-authored-by: Christian Trott <crtrott@sandia.gov>
Signed-off-by: Nathan Ellingwood <ndellin@sandia.gov>
[ 45%] Building CXX object packages/kokkos/containers/src/CMakeFiles/kokkoscontainers.dir/impl/Kokkos_UnorderedMap_impl.cpp.o In file included from /fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/View/MDSpan/Kokkos_MDSpan_Extents.hpp:25, from /fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/Kokkos_View.hpp:40, from /fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/Kokkos_Parallel.hpp:31, from /fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/Kokkos_MemoryPool.hpp:26, from /fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/Kokkos_TaskScheduler.hpp:34, from /fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/Serial/Kokkos_Serial.hpp:37, from /fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/decl/Kokkos_Declare_SERIAL.hpp:21, from /fgs/spdomin/nightly/Trilinos/build_nightly_release_10.3.0/packages/kokkos/KokkosCore_Config_DeclareBackend.hpp:22, from /fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/Kokkos_Core.hpp:45, from /fgs/spdomin/nightly/Trilinos/packages/kokkos/containers/src/Kokkos_UnorderedMap.hpp:30, from /fgs/spdomin/nightly/Trilinos/packages/kokkos/containers/src/impl/Kokkos_UnorderedMap_impl.cpp:21: /fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/View/MDSpan/Kokkos_MDSpan_Header.hpp:47:10: fatal error: mdspan/mdspan.hpp: No such file or directory 47 | #include <mdspan/mdspan.hpp>
@spdomin did the build failure occur with a clean build? If needed you can disable mdspan with the -D Kokkos_ENABLE_IMPL_MDSPAN=OFF
option to get past the error above
This build is part of a bisect to figure out the hang I am seeing. I configure Trilinos each step. So, yes, I think this is a clean build.
I added:
-DKokkos_ENABLE_ATOMICS_BYPASS=ON \ -DKokkos_ENABLE_IMPL_MDSPAN=OFF \
Sorry, I am somewhat taking over this support ticket... I will post back if our new hang points to this commit, while using some of the advice given above.
@spdomin can you post your Trilinos configuration reproducer? We saw similar issues in Trilinos builds like you posted above that were resolved by https://github.com/kokkos/kokkos/pull/7103 (included in the 4.4 snapshot), we'll need to reproduce and open an issue to figure out why that does not help in your configuration
I use this:
https://github.com/NaluCFD/Nalu/blob/master/build/do-configTrilinos_release
with:
1) binutils/2.41 3) openmpi/4.1.6-gcc-10.3.0 5) anaconda3/2023.09
2)gcc/10.3.0 4) cmake/3.27.7
The current build with the new MDSPAN=OFF is proceeding.
There you go:)
`a5eb4d4e1436e5594ce73ffe62e1cb0f460c99b0 is the first bad commit commit a5eb4d4e1436e5594ce73ffe62e1cb0f460c99b0 Author: Nathan Ellingwood ndellin@sandia.gov Date: Thu Aug 8 15:37:54 2024 -0600
Snapshot of kokkos.git from commit 948c1346301ff9b42b136a8c72eed91c839e3105
From repository at git@github.com:kokkos/kokkos.git
At commit:
commit 948c1346301ff9b42b136a8c72eed91c839e3105
Author: Nathan Ellingwood <ndellin@sandia.gov>
Date: Thu Aug 8 14:54:40 2024 -0600
`
I will review the notes above. Offhand, I do not know about this view-of-views pattern in Nalu...
@spdomin the tool can be used more generally beyond View of Views to detect allocation/deallocation/fences within parallel regions and such (the naming was initially inspired by the first cases that showed up with this issue). If the hang is caused by something along these lines, then the tool will be helpful to list the potentially culprit View(s)
@spdomin so far I am not able to reproduce the compilation error you saw. I tested on solo, which had the closest match I could find to modules that you listed, and pared back some of the configuration script you pointed to - the error occurs in kokkos, so enabling netcdf and packages using it like seacas etc. was not necessary to try to reproduce. The error should occur just attempting to build the kokkos library, though I enabled kokkos tests for added coverage but no luck.
Here is what I tried on solo with sha 1eb0af7328f4dcb0e79bcb43115c6718740a3387 (includes the snapshot sha listed above)
# environment
module load gnu/10.3.1 openmpi-gnu/4.1 cmake
export blas_install_lib=/usr/lib64/libblas.so.3
export lapack_install_lib=/usr/lib64/liblapack.so.3
# build dir
mkdir -p Build
cd Build
# configuration
export TRILINOS_DIR=<path-to-Trilinos>
cmake \
-DCMAKE_INSTALL_PREFIX=$PWD/install \
-DTrilinos_ENABLE_CXX11=ON \
-DCMAKE_BUILD_TYPE=RELEASE \
-DTrilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON \
-DTpetra_INST_DOUBLE:BOOL=ON \
-DTpetra_INST_INT_LONG:BOOL=ON \
-DTpetra_INST_INT_LONG_LONG:BOOL=OFF \
-DTpetra_INST_COMPLEX_DOUBLE=OFF \
-DTrilinos_ENABLE_TESTS:BOOL=OFF \
-DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES=OFF \
-DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF \
-DTPL_ENABLE_MPI=ON \
-DTPL_ENABLE_SuperLU=OFF \
-DTPL_ENABLE_Boost:BOOL=OFF \
-DTrilinos_ENABLE_Epetra:BOOL=OFF \
-DTrilinos_ENABLE_Kokkos:BOOL=ON \
-DKokkos_ENABLE_TESTS:BOOL=ON \
-DTrilinos_ENABLE_Tpetra:BOOL=ON \
-DTrilinos_ENABLE_ML:BOOL=OFF \
-DTrilinos_ENABLE_MueLu:BOOL=ON \
-DTrilinos_ENABLE_Stratimikos:BOOL=OFF \
-DTrilinos_ENABLE_Thyra:BOOL=OFF \
-DTrilinos_ENABLE_EpetraExt:BOOL=OFF \
-DTrilinos_ENABLE_AztecOO:BOOL=OFF \
-DTrilinos_ENABLE_Belos:BOOL=ON \
-DTrilinos_ENABLE_Ifpack2:BOOL=ON \
-DTrilinos_ENABLE_Amesos2:BOOL=ON \
-DTrilinos_ENABLE_Zoltan2:BOOL=ON \
-DTrilinos_ENABLE_Ifpack:BOOL=OFF \
-DTrilinos_ENABLE_Amesos:BOOL=OFF \
-DTrilinos_ENABLE_Zoltan:BOOL=ON \
-DTrilinos_ENABLE_STKMesh:BOOL=ON \
-DTrilinos_ENABLE_STKSimd:BOOL=ON \
-DTrilinos_ENABLE_STKIO:BOOL=OFF \
-DTrilinos_ENABLE_STKTransfer:BOOL=ON \
-DTrilinos_ENABLE_STKSearch:BOOL=ON \
-DTrilinos_ENABLE_STKUtil:BOOL=ON \
-DTrilinos_ENABLE_STKTopology:BOOL=ON \
-DTrilinos_ENABLE_STKBalance:BOOL=OFF \
-DTrilinos_ENABLE_STKUnit_tests:BOOL=OFF \
-DTrilinos_ENABLE_STKUnit_test_utils:BOOL=OFF \
-DTrilinos_ENABLE_Gtest:BOOL=ON \
-DKokkos_ENABLE_ATOMICS_BYPASS=ON \
-DTPL_ENABLE_Netcdf:BOOL=OFF \
-DTPL_BLAS_LIBRARIES=${blas_install_lib} \
-DTPL_LAPACK_LIBRARIES=${lapack_install_lib} \
$EXTRA_ARGS \
$TRILINOS_DIR
# build kokkos and tests
cd packages/kokkos
make -j16
Would you be able to test the configuration above in manual build on the machine where you see the issue?
@ndellingwood, Let's take the build error during my bisect finding offline, or add a new ticket so that this particular ticket can focus on apps using "views of views". It turns out, I was able to locate the offending code in the failing unit tests, @alanw0 may have more insight. It does not look like our core Nalu assembly has this issue. The first hang occurs at: rhs_ = Kokkos::View<double*>("rhs_",rhs.extent(0));
. I suppose that is the views of views:) Let me know what is conceptually wrong with this and the easiest fix.
virtual void sumInto(
unsigned numEntities,
const stk::mesh::Entity* entities,
const sierra::nalu::SharedMemView<const double*> & rhs,
const sierra::nalu::SharedMemView<const double**> & lhs,
const sierra::nalu::SharedMemView<int*> & localIds,
const sierra::nalu::SharedMemView<int*> & sortPermutation,
const char * trace_tag
)
{
if (numSumIntoCalls_ == 0) {
rhs_ = Kokkos::View<double*>("rhs_",rhs.extent(0));
for(size_t i=0; i<rhs.extent(0); ++i) {
rhs_(i) = rhs(i);
}
lhs_ = Kokkos::View<double**>("lhs_",lhs.extent(0), lhs.extent(1));
for(size_t i=0; i<lhs.extent(0); ++i) {
for(size_t j=0; j<lhs.extent(1); ++j) {
lhs_(i,j) = lhs(i,j);
}
}
}
Kokkos::atomic_add(&numSumIntoCalls_, 1u);
}
Hmm, that's not a view-of-views, but it is a view allocation which is probably happening within a Kokkos::parallel_for. It turns out that is not legal even in Kokkos::Serial. I can probably help fix this.
The T/F team pointed to me to this:
https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/View.html#can-i-make-a-view-of-views
Hmm, that's not a view-of-views, but it is a view allocation which is probably happening within a Kokkos::parallel_for. It turns out that is not legal even in Kokkos::Serial. I can probably help fix this.
Are you sure about this not being a view of a view? rhs_ is a Kokkos::View<double*>. Why do we not simply use this view itself?
I did manage to build and use the vov debugger library. But it's throwing an error at a location prior to the one causing the hang. I'm assuming that will have to be fixed as well. I'm behind the curve on this one because I'm not a kokkos programmer. I'm acting as the intermediary, since the person with actual knowledge of kokkos and gemma aren't on github. Anyway, with KOKKOS_TOOLS_LIBS=<root dir>/libkp_view_of_views_bug_finder.so
set this is what I see:
Total number of MPI threads: 3
Total number of Tpetra processes: 1
Tpetra in Trilinos 16.1.0-dev
Gemma: Version 2023.0.0
Parsing command line inputs.
Finished parsing command line inputs.
Kokkos execution space N6Kokkos6OpenMPE
Moment method selected
Reading input file ...
Number of Unknowns initialized: 36
Constructing and solving the matrix equation ...
dbg( lvl: 2 ): Allocated system matrix
Constructing matrix and right-hand side for 3.00000e+07 Hz
deallocating "[unlabeled]" within parallel region "interaction_block_fill"
[cee-build032:1412318] *** Process received signal ***
[cee-build032:1412318] Signal: Aborted (6)
[cee-build032:1412318] Signal code: (-6)
[cee-build032:1412318] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7f992ae29cf0]
[cee-build032:1412318] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f992aaa0acf]
[cee-build032:1412318] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f992aa73ea5]
[cee-build032:1412318] [ 3] /ascldap/users/glhenni/Projects/Kokkos/kokkos-tools.dalg24/build/debugging/vov-bug-finder/libkp_view_of_views_bug_finder.so(+0xb7ab)[0x7f99280d27ab]
[cee-build032:1412318] [ 4] /ascldap/users/glhenni/Projects/Kokkos/kokkos-tools.dalg24/build/debugging/vov-bug-finder/libkp_view_of_views_bug_finder.so(kokkosp_deallocate_data+0x6c)[0x7f99280d294b]
[cee-build032:1412318] [ 5] /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/lib64/libkokkoscore.so.16(void Kokkos::Tools::Experimental::invoke_kokkosp_callback<void (*)(Kokkos_Profiling_SpaceHandle, char const*, void const*, unsigned long), Kokkos_Profiling_SpaceHandle const&, char const*, void const*&, unsigned long const&>(Kokkos::Tools::Experimental::MayRequireGlobalFencing, void (* const&)(Kokkos_Profiling_SpaceHandle, char const*, void const*, unsigned long), Kokkos_Profiling_SpaceHandle const&, char const*&&, void const*&, unsigned long const&)+0x132)[0x7f992bc78e83]
[cee-build032:1412318] [ 6] /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/lib64/libkokkoscore.so.16(Kokkos::Tools::deallocateData(Kokkos_Profiling_SpaceHandle, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void const*, unsigned long)+0x51)[0x7f992bc7673e]
[cee-build032:1412318] [ 7] /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/lib64/libkokkoscore.so.16(Kokkos::HostSpace::impl_deallocate(char const*, void*, unsigned long, unsigned long, Kokkos_Profiling_SpaceHandle) const+0xc4)[0x7f992bc70778]
[cee-build032:1412318] [ 8] /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/lib64/libkokkoscore.so.16(Kokkos::Impl::OpenMPInternal::resize_thread_data(unsigned long, unsigned long, unsigned long, unsigned long)+0x2af)[0x7f992bc7f831]
[cee-build032:1412318] [ 9] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(Kokkos::Impl::ParallelFor<gemma::assembly::LBUComputeMatrixFunctor<gemma::assembly::LBUTestIntegrand<gemma::assembly::LBUTestArgs<gemma::assembly::LBUTopoDef<2>, gemma::assembly::LBUTopoDef<2>, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::LBUSourceIntegrand<gemma::assembly::LBUSourceIntegrandArgs<gemma::assembly::LBUTopoDef<2>, gemma::assembly::LBUTopoDef<2>, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> > >, (gemma::codeStructures::DefinedTopology)0> >, true>, gemma::assembly::SystemMatrixIncrementer<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >, false>, Kokkos::TeamPolicy<Kokkos::OpenMP>, Kokkos::OpenMP>::execute() const+0x88)[0x7f993f698ad4]
[cee-build032:1412318] [10] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(void Kokkos::parallel_for<Kokkos::TeamPolicy<Kokkos::OpenMP>, gemma::assembly::LBUComputeMatrixFunctor<gemma::assembly::LBUTestIntegrand<gemma::assembly::LBUTestArgs<gemma::assembly::LBUTopoDef<2>, gemma::assembly::LBUTopoDef<2>, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::LBUSourceIntegrand<gemma::assembly::LBUSourceIntegrandArgs<gemma::assembly::LBUTopoDef<2>, gemma::assembly::LBUTopoDef<2>, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> > >, (gemma::codeStructures::DefinedTopology)0> >, true>, gemma::assembly::SystemMatrixIncrementer<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >, false>, void>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Kokkos::TeamPolicy<Kokkos::OpenMP> const&, gemma::assembly::LBUComputeMatrixFunctor<gemma::assembly::LBUTestIntegrand<gemma::assembly::LBUTestArgs<gemma::assembly::LBUTopoDef<2>, gemma::assembly::LBUTopoDef<2>, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::LBUSourceIntegrand<gemma::assembly::LBUSourceIntegrandArgs<gemma::assembly::LBUTopoDef<2>, gemma::assembly::LBUTopoDef<2>, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> > >, (gemma::codeStructures::DefinedTopology)0> >, true>, gemma::assembly::SystemMatrixIncrementer<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >, false> const&)+0x89)[0x7f993f69881c]
[cee-build032:1412318] [11] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(void gemma::assembly::fillMatrixInteractionBlock<gemma::assembly::LBUTestIntegrand<gemma::assembly::LBUTestArgs<gemma::assembly::LBUTopoDef<2>, gemma::assembly::LBUTopoDef<2>, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::LBUSourceIntegrand<gemma::assembly::LBUSourceIntegrandArgs<gemma::assembly::LBUTopoDef<2>, gemma::assembly::LBUTopoDef<2>, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> > >, (gemma::codeStructures::DefinedTopology)0> >, true>, gemma::assembly::SystemMatrixIncrementer<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> > >(gemma::FrequencyIndependentProblemData const&, Kokkos::View<gemma::misc::PropertyConstants*, Kokkos::HostSpace> const&, gemma::assembly::MatrixFillComputation const&, Kokkos::pair<long long, long long>, Kokkos::pair<long long, long long>, bool, gemma::assembly::SystemMatrixIncrementer<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&, Kokkos::OpenMP)+0x252)[0x7f993f698076]
[cee-build032:1412318] [12] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(auto gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}::operator()<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >(Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace>&) const+0xd6)[0x7f993f697b4a]
[cee-build032:1412318] [13] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(void std::__invoke_impl<void, gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace>&>(std::__invoke_other, gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}&&, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace>&)+0x37)[0x7f993f698c25]
[cee-build032:1412318] [14] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(std::__invoke_result<gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace>&>::type std::__invoke<gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace>&>(gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}&&, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace>&)+0x37)[0x7f993f6988de]
[cee-build032:1412318] [15] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(std::__detail::__variant::__gen_vtable_impl<std::__detail::__variant::_Multi_array<std::__detail::__variant::__deduce_visit_result<void> (*)(gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}&&, std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&)>, std::integer_sequence<unsigned long, 0ul> >::__visit_invoke(gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}&&, std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&)+0x3f)[0x7f993f698240]
[cee-build032:1412318] [16] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(decltype(auto) std::__do_visit<std::__detail::__variant::__deduce_visit_result<void>, gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}, std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&>(gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}&&, std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&)+0x74)[0x7f993f6982bb]
[cee-build032:1412318] [17] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(std::invoke_result<gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}, std::__conditional<is_lvalue_reference_v<std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&> >::type<std::variant_alternative<0ul, std::remove_reference<decltype (__as((declval<std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&>)()))>::type>::type&, std::variant_alternative<0ul, std::remove_reference<decltype (__as((declval<std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&>)()))>::type>::type&&> >::type std::visit<gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}, std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&>(gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}&&, std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&)+0x59)[0x7f993f69831c]
[cee-build032:1412318] [18] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(void gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)+0x2c7)[0x7f993f69790e]
[cee-build032:1412318] [19] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(void gemma::assembly::computeSourceInteractions<2, (gemma::assembly::FILL_TOPO_TYPE)0, (gemma::assembly::FILL_TOPO_TYPE)1, (gemma::assembly::FILL_TOPO_TYPE)2, (gemma::assembly::FILL_TOPO_TYPE)3>(gemma::FrequencyIndependentProblemData const&, double, gemma::codeStructures::RunOptions const&, std::integer_sequence<gemma::assembly::FILL_TOPO_TYPE, (gemma::assembly::FILL_TOPO_TYPE)0, (gemma::assembly::FILL_TOPO_TYPE)1, (gemma::assembly::FILL_TOPO_TYPE)2, (gemma::assembly::FILL_TOPO_TYPE)3>)+0x6a)[0x7f993f7136c2]
[cee-build032:1412318] [20] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(void gemma::assembly::computeMatrixInteractions<(gemma::assembly::FILL_TOPO_TYPE)0, (gemma::assembly::FILL_TOPO_TYPE)1, (gemma::assembly::FILL_TOPO_TYPE)2, (gemma::assembly::FILL_TOPO_TYPE)3, (gemma::assembly::FILL_TOPO_TYPE)4>(gemma::FrequencyIndependentProblemData const&, double, gemma::codeStructures::RunOptions const&, std::integer_sequence<gemma::assembly::FILL_TOPO_TYPE, (gemma::assembly::FILL_TOPO_TYPE)0, (gemma::assembly::FILL_TOPO_TYPE)1, (gemma::assembly::FILL_TOPO_TYPE)2, (gemma::assembly::FILL_TOPO_TYPE)3, (gemma::assembly::FILL_TOPO_TYPE)4>)+0x69)[0x7f993f71339d]
[cee-build032:1412318] [21] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(gemma::assembly::fillSystemMatrixAndRightHandSide(gemma::FrequencyIndependentProblemData const&, double const&, Kokkos::View<gemma::source::FieldExcitation const*, Kokkos::HostSpace> const&, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutRight, Kokkos::HostSpace> const&, gemma::codeStructures::RunOptions const&)+0x3b)[0x7f993f70f90a]
[cee-build032:1412318] [22] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(gemma::MoMLoop::fillAndSolveForFrequency(gemma::ProblemData const&, gemma::source::Frequency const&, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutRight, Kokkos::HostSpace> const&, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutRight, Kokkos::HostSpace> const&, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutRight, Kokkos::HostSpace> const&, gemma::codeStructures::RunOptions const&, bool, std::optional<gemma::linearAlgebra::SPCASolverInfo>&, bool)+0x3a4)[0x7f993fa17dfc]
[cee-build032:1412318] [23] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(gemma::MoMLoop::FrequencyIterator::computeOrReadSolution(Kokkos::pair<int, int> const&)+0x9dc)[0x7f993f9f1994]
[cee-build032:1412318] [24] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(gemma::MoMLoop::FrequencyListIterator::solveForAllFrequenciesAndExcitations()+0x100)[0x7f993f9f865e]
[cee-build032:1412318] [25] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(gemma::solveMomentMethodProblem(Teuchos::RCP<Teuchos::StackedTimer>&, Teuchos::RCP<Teuchos::Comm<int> const>, gemma::codeStructures::RunOptions)+0x511)[0x7f993fbe6f48]
[cee-build032:1412318] [26] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(gemma::selectSolverAndRun(Teuchos::RCP<Teuchos::StackedTimer>&, Teuchos::RCP<Teuchos::Comm<int> const>, gemma::codeStructures::RunOptions)+0x35a)[0x7f993fba3e57]
[cee-build032:1412318] [27] /scratch/glhenni/gemma/build/gemma.gnu.opt/Debug/gemma[0x418e75]
[cee-build032:1412318] [28] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f992aa8cd85]
[cee-build032:1412318] [29] /scratch/glhenni/gemma/build/gemma.gnu.opt/Debug/gemma[0x4188ce]
[cee-build032:1412318] *** End of error message ***
@spdomin echoing @alanw0 , the culprit may be a View construction called within a parallelfor (not a view-of-views) triggering an allocation in a parallel region which can deadlock. If the function in the code snip above is called within a parallel, that could be the issue. The code snip above shows assignment of a newly constructed View to an existing View (a View of Views would look something like `Kokkos::View< Kokkos::View<T> > v_of_v("v_of_v", N);`)
It looks like rhs_
and lhs_
must have already have been constructed, is it possible when each is initially allocated to do so with large enough anticipated size that you can then create subviews to assign to rhs
and lhs
(of sizes e.g. rhs.extent(0)
and lhs.extent(0)
, lhs.extent(1)
resp.), rather than assigning a newly constructed View?
@ndellingwood, @alanw0 and I will look into the fix... I think we were being lax within the unit test matrix assembly procedure and should be able to resolve this quickly. Thank you for the v_of_v example - it helped my understanding.
Again, apologies for doubling up on this ticket with the Nalu-specific issue. Best of luck with GEMMA fix. I will certainly keep track to learn more about how others are using Kokkos in apps.
@spdomin let me know how it goes, either on ticket or offline. In case useful, another thought came to mind was if you can decouple the sumInto
routine to separate out the rhs and lhs View allocation steps into a separate routine called prior to sumInto
, for example pseudo-code:
void sizeCheck(const sierra::nalu::SharedMemView<const double*> & rhs,
const sierra::nalu::SharedMemView<const double**> & lhs) {
if (rhs_.extent(0) != rhs.extent(0))
rhs_ = Kokkos::View<double*>("rhs_",rhs.extent(0));
// similar for lhs
}
call sizeCheck
from the host prior to the call of sumInto
@glhenni excellent, thanks for posting the output, this line:
deallocating "[unlabeled]" within parallel region "interaction_block_fill"
points to the parallel_* call where a deallocation of a View is attempted, though the View isn't labeled so it will take a bit of checking. I'll contact you offline to see how best I can try to help more
Bug Report
@crtrott it seems that commit a5eb4d4 is causing our application, GEMMA, to hang in the latter portions of the simulation. All we have to offer for diagnosing the problem so far is the stack trace below, obtained from interrupting the code while in the debugger and run through c++filt. Any suggestions on how to find the problem?