sematic-ai / sematic

An open-source ML pipeline development platform
Other
975 stars 59 forks source link

Stop CloudRunner from calling "cancel" API when it receives sigterm #1117

Closed augray closed 7 months ago

augray commented 7 months ago

Closes #1114

K8s sometimes needs to move pods from one node to another. When this occurs, it first gives the pod a sigterm to allow it to stop gracefully. It will force kill if the pod doesn't stop within a grace period. The cloud runner calls the Sematic API to mark itself as canceled when this occurs, but the correct behavior should be that it just exits the process and picks up where it left off once it is rescheduled. This ability to pick up where it left off is already supported. The part that needs to be changed is that we should not call the "mark canceled" API.

You might ask: if we don't mark the runner as canceled when it gets a sigterm, aren't we susceptible to the Sematic metadata showing it as still running when the pod is terminated and NOT rescheduled? The answer is that either: (a) the pod doesn't actually get rescheduled, in which case there is a brief period of time when the Sematic metadata doesn't reflect the cancellation, but the cleaner should detect the mismatch and mark the runner as canceled the next time it runs OR (b) the pod DOES get rescheduled, in which case it's good that we never marked the runner as canceled.

Testing