guardian / riff-raff

The Guardian's deployment platform
Apache License 2.0
266 stars 18 forks source link

`autoscaling` deploy: re-enable ASG scaling before final stabilisation check #1345

Closed rtyley closed 4 months ago

rtyley commented 4 months ago

This aims to address, to an extent, issue https://github.com/guardian/riff-raff/issues/1342 - that in an autoscaling deploy, apps can not auto-scale until the original number of instances in the ASG is capable of successfully handling the level of traffic, as checked by WaitForStabilization. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker, when a large increase in pageview traffic during a deploy meant the original number of instances could never have handled the load.

image

Disabling auto-scaling

Ever since https://github.com/guardian/riff-raff/pull/83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (SuspendAlarmNotifications), and only re-enabled them at the end of the deploy (ResumeAlarmNotifications) once deployment has successfully completed.

In December 2016, with https://github.com/guardian/riff-raff/pull/403, an additional WaitForStabilization was added as a penultimate deploy step, which served two purposes:

  1. Ensure that the cull of old instances has completed before the deploy ends
  2. Prove that the new boxes can handle their load without the old boxes.

However, the WaitForStabilization step was placed before ResumeAlarmNotifications - so the original number of EC2 instances in the ASG must be capable of supporting the latest level of traffic, otherwise WaitForStabilization and the entire deploy will fail, leaving auto-scaling indefinitely disabled - potentially waiting hours for human intervention to fix the problem.

The old-instance cull must complete

By putting ResumeAlarmNotifications before WaitForStabilization, the Ophan outage would have been shortened from 1 hour to ~2 minutes, but as @jacobwinch points out, making only that change could lead to a race condition:

  1. Once ResumeAlarmNotifications executes, a 'low CPU' alarm could immediately decide to reduce the desired size of the ASG, randomly removing new instances
  2. WaitForStabilization could then see that the ASG is at its desired size, even if it happens that the termination of the old servers has not yet completed
  3. The deploy then reports as 'complete', even though old EC2 instances are potentially still active

Consequently, with this change we are also introducing a new task, WaitForCullToComplete, that checks specifically that EC2 instances tagged for termination have been removed from the ASG - more specific than the old assumption with WaitForStabilization that the ASG being at desired capacity definitely means all the old instances are gone.

The final sequence of tasks in an autoscaling deploy now looks like this:

https://github.com/guardian/riff-raff/blob/0fb840e64ac924ff4b477ccf6d93a80cf0bc56cd/magenta-lib/src/main/scala/magenta/deployment_type/AutoScaling.scala#L199-L215

Common logic for finding EC2 instances tagged for termination, used by both CullInstancesWithTerminationTag & WaitForCullToComplete, has been factored out into the new CullSummary class.

Testing

I've deployed this branch to Riff Raff CODE as build 3385, and then used that to deploy the Ophan Dashboard to CODE, successfully! You can see that the new WaitForCullToComplete step took place after ResumeAlarmNotifications, and the WaitForStabilization step completed immediately after that:

image

cc @guardian/ophan

github-actions[bot] commented 4 months ago

Deploy build 3385 of tools::riffraff to CODE

All deployment options - [Deploy build 3385 of `tools::riffraff` to CODE](https://riffraff.gutools.co.uk/deployment/deployAgain?project=tools%3A%3Ariffraff&build=3385&stage=CODE&updateStrategy=MostlyHarmless&action=deploy) - [Deploy parts of build 3385 to CODE by previewing it first](https://riffraff.gutools.co.uk/preview/yaml?project=tools%3A%3Ariffraff&build=3385&stage=CODE&updateStrategy=MostlyHarmless) - [What's on CODE right now?](https://riffraff.gutools.co.uk/deployment/history?projectName=tools%3A%3Ariffraff&stage=CODE)

From guardian/actions-riff-raff.

rtyley commented 4 months ago

WaitForStabilization for culled instances?

@jacobwinch comments:

In order for this to work I think it would be desirable to replace the final WaitForStabilization task with a new task (WaitForOldInstancesToTerminate, or similar).

The current task is checking for an expected number of instances. This is currently an acceptable way to check because Riff-Raff is essentially holding a lock on the desired capacity setting by blocking scaling operations (i.e. we know that the number won't change).

Once we re-enable scaling the desired number of instances becomes a moving target, so I think it would be better for Riff-Raff to check that there are 0 instances with the Magenta=Terminate tag still running in the ASG. This would allow us to be sure that all instances are now running the build currently being deployed, which means it is safe/correct to mark the deployment as successful.

@jacobwinch, rather than introduce WaitForOldInstancesToTerminate, would it reasonable to update the implementation of CullInstancesWithTerminationTag so that it just blocks until all the instances it's been asked to cull actually have terminated?

jacobwinch commented 4 months ago

would it reasonable to update the implementation of CullInstancesWithTerminationTag so that it just blocks until all the instances it's been asked to cull actually have terminated?

I think it's probably safe to allow scaling actions to run again as soon as we've asked AWS to terminate the instances running the old build. At this point there is no risk of AWS trying to scale down the 'wrong' instances (i.e. the ones running the new build) and we know it is safe to launch more instances that will run with the newest build (as at least one instance running this build has already passed the health check).

With that in mind, the only downside of the suggested approach is that while we're blocking and waiting for the instances to terminate, we could probably be allowing the app to start scaling again if we were to split the 'waiting work' into a separate task. We're probably talking about a pretty small time window here though so if it's considerably easier to implement the version you've suggested then I think that would still be a big improvement!

rtyley commented 4 months ago

You've convinced me @jacobwinch, so I've added some WIP to this PR to implement a WaitForCullToComplete task!

Not quite tidied up yet, but you probably get the idea. Some code from CullInstancesWithTerminationTag is factored out into a CullSummary class so that WaitForCullToComplete can also use it.

rtyley commented 4 months ago

I ran one additional test to confirm that automatically scaling up while Riff-Raff is running WaitForCullToComplete works successfully.

Magnificent! Great test, and gratifying to see it work correctly! I shall now merge 👍