Abort consoles with more than one pod

ttamimi commented 2 years ago

Currently we are setting a console's Job's restartPolicy to "Never", so if any container within the Pod fails then the Job won't spawn another Pod. We are also setting the Job's backoffLimit to 0, so if the Pod itself fails for any reason then the Job won't spawn another Pod either. These two settings together prevent a second Pod from being launched when a failure occurs.

However, these settings don't cover situations where a console’s Pod is deleted. This can happen due to manual deletion, Pod eviction, or Pod preemption. There are no Job settings to prevent relaunching a Pod that has disappeared in one of these ways.

Launching a subsequent console Pod beyond the first one is problematic, even if there is only one running Pod at any given time. A subsequent Pod causes the workloads controller to enter its state machine logic in a way that it wasn't designed to handle. It also causes the console to remain in a running state for far longer than users expect.

With this change, the workloads controller stops a console and deletes its resources if it detects that more than one Pod belongs (or belonged) to that console.

theobarberbany commented 2 years ago

This doesn't appear to work for eviction :(

edit: it takes ~1-2m for the new pod to be removed after eviction of the old one, this will be waiting for the reconcile loop to be triggered. For non-interactive commands we partially re-run before being deleted.

ttamimi commented 2 years ago

Yes spot on — because Kubernetes controllers are independent, there is no guaranteed way to prevent the job controller from spawning a replacement pod when the initial pod is evicted or preempted. The only solution is for the engineering teams to make their non-interactive consoles idempotent.

As discussed, we should probably re-think our approach of using jobs for consoles. In our code we are making a lot of effort to either prevent or circumvent retries, which kind of defeats the purpose of using a job.

theobarberbany commented 2 years ago

As discussed, we should probably re-think our approach of using jobs for consoles. In our code we are making a lot of effort to either prevent or circumvent retries, which kind of defeats the purpose of using a job.

Yep, it feels like this may require a re-think.

All things considered this will give us what we need for now!

gocardless / theatre

Abort consoles with more than one pod #264