Closed rtyley closed 4 months ago
tools::riffraff
to CODEWaitForStabilization
for culled instances?@jacobwinch comments:
In order for this to work I think it would be desirable to replace the final
WaitForStabilization
task with a new task (WaitForOldInstancesToTerminate
, or similar).The current task is checking for an expected number of instances. This is currently an acceptable way to check because Riff-Raff is essentially holding a lock on the desired capacity setting by blocking scaling operations (i.e. we know that the number won't change).
Once we re-enable scaling the desired number of instances becomes a moving target, so I think it would be better for Riff-Raff to check that there are 0 instances with the
Magenta=Terminate
tag still running in the ASG. This would allow us to be sure that all instances are now running the build currently being deployed, which means it is safe/correct to mark the deployment as successful.
@jacobwinch, rather than introduce WaitForOldInstancesToTerminate
, would it reasonable to update the implementation of CullInstancesWithTerminationTag
so that it just blocks until all the instances it's been asked to cull actually have terminated?
would it reasonable to update the implementation of
CullInstancesWithTerminationTag
so that it just blocks until all the instances it's been asked to cull actually have terminated?
I think it's probably safe to allow scaling actions to run again as soon as we've asked AWS to terminate the instances running the old build. At this point there is no risk of AWS trying to scale down the 'wrong' instances (i.e. the ones running the new build) and we know it is safe to launch more instances that will run with the newest build (as at least one instance running this build has already passed the health check).
With that in mind, the only downside of the suggested approach is that while we're blocking and waiting for the instances to terminate, we could probably be allowing the app to start scaling again if we were to split the 'waiting work' into a separate task. We're probably talking about a pretty small time window here though so if it's considerably easier to implement the version you've suggested then I think that would still be a big improvement!
You've convinced me @jacobwinch, so I've added some WIP to this PR to implement a WaitForCullToComplete
task!
Not quite tidied up yet, but you probably get the idea. Some code from CullInstancesWithTerminationTag
is factored out into a CullSummary
class so that WaitForCullToComplete
can also use it.
I ran one additional test to confirm that automatically scaling up while Riff-Raff is running
WaitForCullToComplete
works successfully.
Magnificent! Great test, and gratifying to see it work correctly! I shall now merge 👍
This aims to address, to an extent, issue https://github.com/guardian/riff-raff/issues/1342 - that in an
autoscaling
deploy, apps can not auto-scale until the original number of instances in the ASG is capable of successfully handling the level of traffic, as checked byWaitForStabilization
. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker, when a large increase in pageview traffic during a deploy meant the original number of instances could never have handled the load.Disabling auto-scaling
Ever since https://github.com/guardian/riff-raff/pull/83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (
SuspendAlarmNotifications
), and only re-enabled them at the end of the deploy (ResumeAlarmNotifications
) once deployment has successfully completed.In December 2016, with https://github.com/guardian/riff-raff/pull/403, an additional
WaitForStabilization
was added as a penultimate deploy step, which served two purposes:However, the
WaitForStabilization
step was placed beforeResumeAlarmNotifications
- so the original number of EC2 instances in the ASG must be capable of supporting the latest level of traffic, otherwiseWaitForStabilization
and the entire deploy will fail, leaving auto-scaling indefinitely disabled - potentially waiting hours for human intervention to fix the problem.The old-instance cull must complete
By putting
ResumeAlarmNotifications
beforeWaitForStabilization
, the Ophan outage would have been shortened from 1 hour to ~2 minutes, but as @jacobwinch points out, making only that change could lead to a race condition:ResumeAlarmNotifications
executes, a 'low CPU' alarm could immediately decide to reduce the desired size of the ASG, randomly removing new instancesWaitForStabilization
could then see that the ASG is at its desired size, even if it happens that the termination of the old servers has not yet completedConsequently, with this change we are also introducing a new task,
WaitForCullToComplete
, that checks specifically that EC2 instances tagged for termination have been removed from the ASG - more specific than the old assumption withWaitForStabilization
that the ASG being at desired capacity definitely means all the old instances are gone.The final sequence of tasks in an
autoscaling
deploy now looks like this:https://github.com/guardian/riff-raff/blob/0fb840e64ac924ff4b477ccf6d93a80cf0bc56cd/magenta-lib/src/main/scala/magenta/deployment_type/AutoScaling.scala#L199-L215
Common logic for finding EC2 instances tagged for termination, used by both
CullInstancesWithTerminationTag
&WaitForCullToComplete
, has been factored out into the newCullSummary
class.Testing
I've deployed this branch to Riff Raff CODE as build 3385, and then used that to deploy the Ophan Dashboard to CODE, successfully! You can see that the new
WaitForCullToComplete
step took place afterResumeAlarmNotifications
, and theWaitForStabilization
step completed immediately after that:cc @guardian/ophan