Open pgrete opened 1 month ago
It hangs inside Kokkos::Impl::deallocate
inside parthenon::Mesh::~Mesh
:
[lines deleted]
frame #34: 0x000000010021affc athenaPK`void Kokkos::Impl::deallocate<Kokkos::HostSpace, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, parthenon::BndInfo, false>>(record_ptr=0x0000600001a5ccf0) at Kokkos_SharedAlloc.hpp:382:18 [opt]
[lines deleted]
frame #48: 0x0000000100267294 athenaPK`parthenon::MeshData<double>::~MeshData() [inlined] parthenon::BvarsCache_t::~BvarsCache_t(this=<unavailable>) at bnd_info.hpp:190:8 [opt]
[lines deleted]
frame #65: 0x000000010036abc8 athenaPK`parthenon::Mesh::~Mesh(this=0x000000012281d400) at mesh.cpp:388:1 [opt]
frame #66: 0x00000001003d435c athenaPK`parthenon::ParthenonManager::ParthenonFinalize() [inlined] std::__1::default_delete<parthenon::Mesh>::operator()[abi:v160006](this=<unavailable>, __ptr=<unavailable>) const at unique_ptr.h:65:5 [opt]
frame #67: 0x00000001003d4358 athenaPK`parthenon::ParthenonManager::ParthenonFinalize() [inlined] std::__1::unique_ptr<parthenon::Mesh, std::__1::default_delete<parthenon::Mesh>>::reset[abi:v160006](this=<unavailable>, __p=0x0000000000000000) at unique_ptr.h:297:7 [opt]
frame #68: 0x00000001003d434c athenaPK`parthenon::ParthenonManager::ParthenonFinalize(this=<unavailable>) at parthenon_manager.cpp:232:9 [opt]
frame #69: 0x0000000100002510 athenaPK`main(argc=<unavailable>, argv=<unavailable>) at main.cpp:127:8 [opt]
frame #70: 0x00000001935bc274 dyld`start + 2840
Full backtrace from lldb: backtrace.txt
Appears to be a Kokkos regression introduced in Kokkos 4.4.0 (also present in Kokkos 4.4.01). If I swap out the current Kokkos submodule for Kokkos 4.3.01, it finalizes successfully.
@pgrete Maybe we can revert to Kokkos 4.3.01?
I suspect that this is not a Kokkos regression but sth on our end. Any idea @lroberts36 (as it seems to point to the buffer cache.
So before changing/downgrading the Kokkos version, I'd like to spent a little time to check if this cannot be fixed easily in Parthenon itself.
I just asked on the Kokkos Slack where we should look first.
Slide 47 in https://github.com/kokkos/kokkos-tutorials/blob/main/Other/ReleaseBriefings/release-44.pdf: "Otherwise, you program may hang when you upgrade to 4.4" <- Does that sound familiar?
So it's very likely on us. We were also pointed towards https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/View.html#can-i-make-a-view-of-views and https://github.com/kokkos/kokkos/pull/7229 and https://github.com/kokkos/kokkos-tools/pull/267
I won't be able to look at this today. We might be able to coordinate fixing this as part of the hackathon next week (as we're touching stuff around the buffers anyway).
Actually,I just tried it (the view of view debug tool is quite handy -- thanks @dalg24) and pushed a fix to #1191 (https://github.com/parthenon-hpc-lab/parthenon/pull/1191/commits/b4ab05f5f3c059e6a9fa208a3101c6e47bd0d3aa). Let's see what the pipelines say (at least downstream it seemed to work but I might have missed a view of view).
Observed by @BenWibking on Stampede3 and on a Mac and by myself on a Linux workstation.
Sims run fine and then hang after printing
The last output does also seem to have been written completely.