parthenon-hpc-lab / parthenon

Parthenon AMR infrastructure
https://parthenon-hpc-lab.github.io/parthenon/
Other
105 stars 33 forks source link

Catch task failures from threads #1049

Closed bprather closed 2 months ago

bprather commented 3 months ago

Previously, due to using a worker thread, Parthenon would drop any exceptions from Tasks executed in a TaskList. Exceptions would crash the thread, but execution would continue as if it had finished normally.

This PR checks the futures associated with each task, to make sure there are not exceptions. As a bonus, we can add a potentially non-fatal failure return TaskStatus::fail, which will be propagated up to the Driver.

This has in the past failed the sparse_advection tests, but that may have been due to incidental changes over the course of debugging and testing. It seems to pass all tests reliably now, 3-4 times in a row (except a couple of gmg tests which it also fails on my machine even with threading disabled). Should there turn out to still be a rare race condition, I've extended ThreadVector such that vectors of Tasks or TaskLists could be swapped over to be thread-safe transparently.

PR Checklist