sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.28k forks source link

executor: Graceful shutdowns for autoscaling #23423

Open eseliger opened 3 years ago

eseliger commented 3 years ago

When the autoscaler wants to remove an instance from the pool, we should make sure it always correctly finishes it's currently running tasks first.

efritz commented 3 years ago

Do we know the mechanism by which the service is shut down by the autoscaler? I'm assuming a specific signal will be sent followed by a probable un-reactive timeout (hopefully this can be configured to be our maximum job time).

eseliger commented 3 years ago

This section documents "Preparing to stop", and they suggest using a shutdown script. They are capped to run for 90s max, though. Not sure how we will actually do that .. We could just return the jobs quickly, or let the resetter come by and retry the jobs, in that case we should make sure the scaling happens very infrequently, though.

efritz commented 3 years ago

Is it possible to only scale up and then have the executors shut down after a certain amount of time or jobs?

eseliger commented 3 years ago

Yes that may work. Good idea

github-actions[bot] commented 2 years ago

Heads up @macraig - the "team/code-intelligence" label was applied to this issue.

eseliger commented 1 year ago

Idea: The terraform google executors should be a GCP{Firecracker{Docker}} runtime backed by a backend-based job scheduler, instead of the old pull-based model. That way, we can control the number of VMs on the backend without a google auto scaling group. The same applies for AWS.