We have observed that this can happen if, for example, the runner pod OOM'd. We don't want the Sematic dashboard to show such things as still active when they're really not.
Additionally, the runner jobs weren't really getting their statuses updated except when they were created and when they were killed. I added some intermediate updates by having the jobs get updated whenever the resolution object is saved.
Finally, there are cases where the runner job looked like it might just be pending on k8s, and that was getting confused as still being active. To catch cases like this, I added logic that if a job still hasn't been started (which for run jobs is as soon as k8s acknowledges them, and for resolution jobs as soon as the runner pod updates the resolution status at its start) within 24 hours, consider the job as defunct and no longer active.
Testing
Hacked the runner code so it didn't respond to signals by calling the cancellation API. Hacked the job creation timeout to be 10 minutes rather than 24 hours. Then:
Had the runner immediately exit without doing anything else. Confirmed that this got seen as garbage and cleaned
Started the testing pipeline with two jobs set to wait for 15 minutes. Once the jobs were actually in progress and at the sleep runs, I killed the runner pod for one. Confirmed that it got cleaned (but not until AFTER the sleep run had finished its work), but the one where I didn't kill the runner finished successfully.
Also deployed to dev1, and it cleaned up all the garbage we had there except for stuff from the LocalRunner that hadn't been marked terminal yet. A separate strategy will be needed to address defunct local runners; that's out of scope for this PR.
Closes #1088
We have observed that this can happen if, for example, the runner pod OOM'd. We don't want the Sematic dashboard to show such things as still active when they're really not.
Additionally, the runner jobs weren't really getting their statuses updated except when they were created and when they were killed. I added some intermediate updates by having the jobs get updated whenever the resolution object is saved.
Finally, there are cases where the runner job looked like it might just be pending on k8s, and that was getting confused as still being active. To catch cases like this, I added logic that if a job still hasn't been started (which for run jobs is as soon as k8s acknowledges them, and for resolution jobs as soon as the runner pod updates the resolution status at its start) within 24 hours, consider the job as defunct and no longer active.
Testing
Hacked the runner code so it didn't respond to signals by calling the cancellation API. Hacked the job creation timeout to be 10 minutes rather than 24 hours. Then:
Also deployed to dev1, and it cleaned up all the garbage we had there except for stuff from the LocalRunner that hadn't been marked terminal yet. A separate strategy will be needed to address defunct local runners; that's out of scope for this PR.