Project Nimble is a great tool to execute region evacuation faster and more automatic way when facing outage.
It used to take 50 minutes before, but now it takes only 8 minutes with Nimble.
5 Steps of old method
5 minutes to decide whether they execute or not
3-5 minutes to provision resources from AWS. It was hard to calculate how much capacity needed for each services to absorb traffic
25 minutes for their services to start up. Booting, Launching, Downloading, Making connections, Register with eureka, configuration, ELB.
10 minutes to proxy traffic to destination regions. Proxying with Zuul to move traffic in increments,
5 miinutes to flip DNS. DNS TTLs generally took 5 minutes.
First Iteration
was failed.
It is hard to consult on each of the hundreds of scaling policies.
Also it is hard to calculate desired service size given incoming RPS
Dark Capacity works
Having extra capacity to be able to receive traffic
They could keep instances hidden away in shadow ASGs.
They automated that way to pluck an instance, and push it to ether, and pop it into a running service group.
In production, It it complex due to CI/CD strategies. They use Edda and Spinnaker to track changes.
Also they interject early in the boot process to make dark instances configuration matched to production environment.
For controlling traffic, they changed dark instance's eureka health status to STARTING. It meant instances are ready but waiting for traffic until it is changed to UP.
Also in STARTING status, it is disabled Atlas metric reporting.
Rolling out took only 6 months by 2 member.
It is wide ranging and very impactful project. But nimble is simple. Other teams doesn't need to take effort for region evacuation.
The Nimble can be used for other features like emergency expansion or quick auto-scaling.
Title
Summary
Reference