oxidecomputer / cio

Rust libraries for APIs needed by our automated CIO.
Apache License 2.0
261 stars 40 forks source link

Long running sagas are cancelled on shutdown #161

Open augustuswm opened 2 years ago

augustuswm commented 2 years ago

When cio is shutdown any long running sagas are marked as cancelled. This means that any deployment will:

  1. Interrupt currently running sagas
  2. Prevent those sagas from completing until next run

This issue may be helped by work on breaking down some of the long running sagas, but ideally there is a way to version these jobs such that post-deployment they can either be resumed (if they are still valid), or cancelled (if they are no longer compatible).

augustuswm commented 2 years ago

Related, do_cleanup is not always running in the allotted time between SIGTERM and SIGKILL. This leaves a number of sagas in an indeterminate state. These should get picked up by the refresh-functions job and marked as timed_out, but this is not currently working.

There are additional clues to suggest that child reexec processes are being sent SIGTERM, but we need additional logging to determine this.

augustuswm commented 2 years ago

In addition to deploys, scale up and scale down events by CloudRun will also result it sagas that are being run by a to-be terminated instance to be cancelled.