sematic-ai / sematic

An open-source ML pipeline development platform
Other
975 stars 59 forks source link

Have the cleaner cover cases where the runner pod is dead but Sematic's DB doesn't see that #1100

Closed augray closed 1 year ago

augray commented 1 year ago

Closes #1088

We have observed that this can happen if, for example, the runner pod OOM'd. We don't want the Sematic dashboard to show such things as still active when they're really not.

Additionally, the runner jobs weren't really getting their statuses updated except when they were created and when they were killed. I added some intermediate updates by having the jobs get updated whenever the resolution object is saved.

Finally, there are cases where the runner job looked like it might just be pending on k8s, and that was getting confused as still being active. To catch cases like this, I added logic that if a job still hasn't been started (which for run jobs is as soon as k8s acknowledges them, and for resolution jobs as soon as the runner pod updates the resolution status at its start) within 24 hours, consider the job as defunct and no longer active.

Testing

Hacked the runner code so it didn't respond to signals by calling the cancellation API. Hacked the job creation timeout to be 10 minutes rather than 24 hours. Then:

Also deployed to dev1, and it cleaned up all the garbage we had there except for stuff from the LocalRunner that hadn't been marked terminal yet. A separate strategy will be needed to address defunct local runners; that's out of scope for this PR.