Open augustuswm opened 2 years ago
Related, do_cleanup
is not always running in the allotted time between SIGTERM
and SIGKILL
. This leaves a number of sagas in an indeterminate state. These should get picked up by the refresh-functions
job and marked as timed_out
, but this is not currently working.
There are additional clues to suggest that child reexec
processes are being sent SIGTERM
, but we need additional logging to determine this.
In addition to deploys, scale up and scale down events by CloudRun will also result it sagas that are being run by a to-be terminated instance to be cancelled.
When
cio
is shutdown any long running sagas are marked as cancelled. This means that any deployment will:This issue may be helped by work on breaking down some of the long running sagas, but ideally there is a way to version these jobs such that post-deployment they can either be resumed (if they are still valid), or cancelled (if they are no longer compatible).