The cloud runner is sometimes restarted unexpectedly by K8s. Prior to this PR, this was handled by starting an entirely new pipeline run, with a new id, starting where the old one left off. This PR changes the behavior such that the runner is instead able to re-create its internal state and continue from where it left off. This (unlike the previous behavior) should be transparent to end-users.
Testing
Disabled the signal handlers in the StateMachineRunner, so it wouldn't interpret kubectl delete pod ... as cancellations, then performed the following tests, using kubectl delete pod on the runner pod to emulate evictions:
With an entirely inline graph, interrupted mid-execution. Run
With standalone functions, interrupted mid-execution. Run
With multiple base images, interrupted mid-execution. Run
With an implicit make_list, interrupted mid-execution. Run
The cloud runner is sometimes restarted unexpectedly by K8s. Prior to this PR, this was handled by starting an entirely new pipeline run, with a new id, starting where the old one left off. This PR changes the behavior such that the runner is instead able to re-create its internal state and continue from where it left off. This (unlike the previous behavior) should be transparent to end-users.
Testing
Disabled the signal handlers in the StateMachineRunner, so it wouldn't interpret
kubectl delete pod ...
as cancellations, then performed the following tests, usingkubectl delete pod
on the runner pod to emulate evictions: