parthenon-hpc-lab / parthenon

Parthenon AMR infrastructure
https://parthenon-hpc-lab.github.io/parthenon/
Other
125 stars 37 forks source link

Parthenon hangs at the end of simulation #1193

Open pgrete opened 1 month ago

pgrete commented 1 month ago

Observed by @BenWibking on Stampede3 and on a Mac and by myself on a Linux workstation.

Sims run fine and then hang after printing

Driver completed.
time=1.50e-01 cycle=35
tlim=1.50e-01 nlim=100000

walltime used = 4.23e+00
zone-cycles/wallsecond = 1.49e+06

The last output does also seem to have been written completely.

BenWibking commented 1 month ago

It hangs inside Kokkos::Impl::deallocate inside parthenon::Mesh::~Mesh:

[lines deleted]
    frame #34: 0x000000010021affc athenaPK`void Kokkos::Impl::deallocate<Kokkos::HostSpace, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, parthenon::BndInfo, false>>(record_ptr=0x0000600001a5ccf0) at Kokkos_SharedAlloc.hpp:382:18 [opt]
[lines deleted]
    frame #48: 0x0000000100267294 athenaPK`parthenon::MeshData<double>::~MeshData() [inlined] parthenon::BvarsCache_t::~BvarsCache_t(this=<unavailable>) at bnd_info.hpp:190:8 [opt]
[lines deleted]
    frame #65: 0x000000010036abc8 athenaPK`parthenon::Mesh::~Mesh(this=0x000000012281d400) at mesh.cpp:388:1 [opt]
    frame #66: 0x00000001003d435c athenaPK`parthenon::ParthenonManager::ParthenonFinalize() [inlined] std::__1::default_delete<parthenon::Mesh>::operator()[abi:v160006](this=<unavailable>, __ptr=<unavailable>) const at unique_ptr.h:65:5 [opt]
    frame #67: 0x00000001003d4358 athenaPK`parthenon::ParthenonManager::ParthenonFinalize() [inlined] std::__1::unique_ptr<parthenon::Mesh, std::__1::default_delete<parthenon::Mesh>>::reset[abi:v160006](this=<unavailable>, __p=0x0000000000000000) at unique_ptr.h:297:7 [opt]
    frame #68: 0x00000001003d434c athenaPK`parthenon::ParthenonManager::ParthenonFinalize(this=<unavailable>) at parthenon_manager.cpp:232:9 [opt]
    frame #69: 0x0000000100002510 athenaPK`main(argc=<unavailable>, argv=<unavailable>) at main.cpp:127:8 [opt]
    frame #70: 0x00000001935bc274 dyld`start + 2840

Full backtrace from lldb: backtrace.txt

BenWibking commented 1 month ago

Appears to be a Kokkos regression introduced in Kokkos 4.4.0 (also present in Kokkos 4.4.01). If I swap out the current Kokkos submodule for Kokkos 4.3.01, it finalizes successfully.

BenWibking commented 1 month ago

@pgrete Maybe we can revert to Kokkos 4.3.01?

pgrete commented 1 month ago

I suspect that this is not a Kokkos regression but sth on our end. Any idea @lroberts36 (as it seems to point to the buffer cache.

So before changing/downgrading the Kokkos version, I'd like to spent a little time to check if this cannot be fixed easily in Parthenon itself.

pgrete commented 1 month ago

I just asked on the Kokkos Slack where we should look first.

pgrete commented 1 month ago

Slide 47 in https://github.com/kokkos/kokkos-tutorials/blob/main/Other/ReleaseBriefings/release-44.pdf: "Otherwise, you program may hang when you upgrade to 4.4" <- Does that sound familiar?

So it's very likely on us. We were also pointed towards https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/View.html#can-i-make-a-view-of-views and https://github.com/kokkos/kokkos/pull/7229 and https://github.com/kokkos/kokkos-tools/pull/267

I won't be able to look at this today. We might be able to coordinate fixing this as part of the hackathon next week (as we're touching stuff around the buffers anyway).

pgrete commented 1 month ago

Actually,I just tried it (the view of view debug tool is quite handy -- thanks @dalg24) and pushed a fix to #1191 (https://github.com/parthenon-hpc-lab/parthenon/pull/1191/commits/b4ab05f5f3c059e6a9fa208a3101c6e47bd0d3aa). Let's see what the pipelines say (at least downstream it seemed to work but I might have missed a view of view).