Open eseliger opened 3 years ago
Do we know the mechanism by which the service is shut down by the autoscaler? I'm assuming a specific signal will be sent followed by a probable un-reactive timeout (hopefully this can be configured to be our maximum job time).
This section documents "Preparing to stop", and they suggest using a shutdown script. They are capped to run for 90s max, though. Not sure how we will actually do that .. We could just return the jobs quickly, or let the resetter come by and retry the jobs, in that case we should make sure the scaling happens very infrequently, though.
Is it possible to only scale up and then have the executors shut down after a certain amount of time or jobs?
Yes that may work. Good idea
Heads up @macraig - the "team/code-intelligence" label was applied to this issue.
Idea: The terraform google executors should be a GCP{Firecracker{Docker}} runtime backed by a backend-based job scheduler, instead of the old pull-based model. That way, we can control the number of VMs on the backend without a google auto scaling group. The same applies for AWS.
When the autoscaler wants to remove an instance from the pool, we should make sure it always correctly finishes it's currently running tasks first.