K8s sometimes needs to move pods from one node to another. When this occurs, it first gives the pod a sigterm to allow it to stop gracefully. It will force kill if the pod doesn't stop within a grace period. The cloud runner calls the Sematic API to mark itself as canceled when this occurs, but the correct behavior should be that it just exits the process and picks up where it left off once it is rescheduled. This ability to pick up where it left off is already supported. The part that needs to be changed is that we should not call the "mark canceled" API.
You might ask: if we don't mark the runner as canceled when it gets a sigterm, aren't we susceptible to the Sematic metadata showing it as still running when the pod is terminated and NOT rescheduled? The answer is that either:
(a) the pod doesn't actually get rescheduled, in which case there is a brief period of time when the Sematic metadata doesn't reflect the cancellation, but the cleaner should detect the mismatch and mark the runner as canceled the next time it runs OR
(b) the pod DOES get rescheduled, in which case it's good that we never marked the runner as canceled.
Testing
Launched the testing pipeline with a long sleep time. Manually deleted the runner pod using kubectl (not the job). This simulates a k8s eviction since k8s will see that the job is still active and schedule a new pod for it. Confirmed that it picked up where it left off, and logged an appropriate message to runner logs
Launched the local runner testing pipeline with a long sleep and confirmed that ctrl+c marked the pipeline as canceled in the Sematic dashboard
Launched the testing pipeline with a long sleep time. Manually deleted the job this time using kubectl. This would emulate a scenario where the runner pod was killed rather than being evicted for a move. Confirmed that eventually the cleaner marked the run as terminated with the message in the user console "Run was still alive despite the resolution being terminated. Run was forced to fail."
Closes #1114
K8s sometimes needs to move pods from one node to another. When this occurs, it first gives the pod a sigterm to allow it to stop gracefully. It will force kill if the pod doesn't stop within a grace period. The cloud runner calls the Sematic API to mark itself as canceled when this occurs, but the correct behavior should be that it just exits the process and picks up where it left off once it is rescheduled. This ability to pick up where it left off is already supported. The part that needs to be changed is that we should not call the "mark canceled" API.
You might ask: if we don't mark the runner as canceled when it gets a sigterm, aren't we susceptible to the Sematic metadata showing it as still running when the pod is terminated and NOT rescheduled? The answer is that either: (a) the pod doesn't actually get rescheduled, in which case there is a brief period of time when the Sematic metadata doesn't reflect the cancellation, but the cleaner should detect the mismatch and mark the runner as canceled the next time it runs OR (b) the pod DOES get rescheduled, in which case it's good that we never marked the runner as canceled.
Testing
kubectl
(not the job). This simulates a k8s eviction since k8s will see that the job is still active and schedule a new pod for it. Confirmed that it picked up where it left off, and logged an appropriate message to runner logs