Week2 - Project Nimble: Region Evacuation Reimagined

Title

Project Nimble: Region Evacuation Reimagined

Summary

Project Nimble is a great tool to execute region evacuation faster and more automatic way when facing outage.
It used to take 50 minutes before, but now it takes only 8 minutes with Nimble.
5 Steps of old method
- 5 minutes to decide whether they execute or not
- 3-5 minutes to provision resources from AWS. It was hard to calculate how much capacity needed for each services to absorb traffic
- 25 minutes for their services to start up. Booting, Launching, Downloading, Making connections, Register with eureka, configuration, ELB.
- 10 minutes to proxy traffic to destination regions. Proxying with Zuul to move traffic in increments,
- 5 miinutes to flip DNS. DNS TTLs generally took 5 minutes.
First Iteration
- was failed.
- It is hard to consult on each of the hundreds of scaling policies.
- Also it is hard to calculate desired service size given incoming RPS
Dark Capacity works
- Having extra capacity to be able to receive traffic
- They could keep instances hidden away in shadow ASGs.
- They automated that way to pluck an instance, and push it to ether, and pop it into a running service group.
- In production, It it complex due to CI/CD strategies. They use Edda and Spinnaker to track changes.
- Also they interject early in the boot process to make dark instances configuration matched to production environment.
- For controlling traffic, they changed dark instance's eureka health status to STARTING. It meant instances are ready but waiting for traffic until it is changed to UP.
- Also in STARTING status, it is disabled Atlas metric reporting.
Rolling out took only 6 months by 2 member.
It is wide ranging and very impactful project. But nimble is simple. Other teams doesn't need to take effort for region evacuation.
The Nimble can be used for other features like emergency expansion or quick auto-scaling.

Reference

https://medium.com/netflix-techblog/project-nimble-region-evacuation-reimagined-d0d0568254d4

Words

We are proud to present Nimble
region evacuation an order of magnitude faster
our goal is to be there for our customers whenever they want to come and watch their favorite shows
A lot of the work we do centers around
averting or limiting customer-facing outages
to route traffic away from an AWS region that is unhealthy.
Because Netflix continues to grow quickly
we are now at a point where even short or partial outages affect many of our customers
This article describes
how we re-imagined region failover
from what used to take close to an hour to less than 10 minutes
all while remaining cost neutral.
is captured in three prior articles.
Nimble takes us to the next level by optimizing the way
As part of our project requirements
minimal changes to
no disruptions to work schedules
no onerous maintenance requirements dropped
on other engineering teams at the company
When we set out on this journey
we began by breaking down the time it took then to do a traffic failover
to decide whether we would push the failover button or not.
enormous amounts of
a failover was a somewhat risky not only slow, path to take.
diurnal patterns of traffic.
where they could absorb the additional traffic they would see
failover needed to include a step of computing
apply any morphing changes specified through our Archaius
we could only do so much to coax them into starting faster
under threat of receiving traffic
To compensate for DNS TTL delays
via back-end tunnels
to gauge the “readiness”
We found that we needed to move traffic in increments
cut over DNS
While the calls to repoint our DNS entries completed within seconds
our DNS TTLs generally meant that
the bulk of our devices would move within about 5 minutes
when the vast majority of devices
were being served by a destination region.
add up to about 50 minutes
which we considered unacceptably long
50 minutes of a broken experience
being able to fail over traffic in less than 10 minutes
In order to hit that kind of speed
regional failover would consist purely of flipping DNS records
letting the network move users over.
homeostatic balance using autoscaling policies
when the average CPU usage crosses a threshold
Given the years since our migration to the cloud
dev teams are now operationally familiar with
is well-understood
we would need to make significant changes
Either we would need to change the signals that teams used for autoscaling into something centralized
giving their services instructions divorced from their normal operation
with some kind of linear (or worse) transformation
to take into account failover absorption needs
The idea of opening targeted consultations
did not seem like a winning strategy.
pinning to a calculated value
this would hide performance regressions in the code
some automated way to frequently calculate a desired service size given incoming RPS and scale the buffer based on this metric
no such mechanism was, as yet, available.
we didn’t want to incur
How, then, would we keep spare instances at the ready
We realized that
we could keep instances hidden away in shadow ASGs
We would have to ensure total isolation
until they were activated
so that on failover it would look like we’d provisioned new instances spontaneously when called upon
we can pluck an instance from the dark autoscaling group
push it into the ether
make a subsequent EC2 API call to pop it into a running service group
It was straightforward to test the detach and attach mechanism with a single ASG
but for a production environment incorporating many ASGs
this service may easily get overwhelmed
which can then have negative downstream and upstream effects on other services and eventually on our customers
in a specific region
a pre-populated set of environment variables
autodetects the ASG that a service is in
sets a number of variables corresponding to this
interject early in the boot process
blissfully unaware of their actual location.
otherwise we’d just created a very elaborate mechanism of pinning services high
our Runtime team helped us devise a library (included in all of our services through our common platform)
they would come up at the ready but wait for
even when not munching on customer traffic
produce an incredible amount of metrics
and to that end
we enlisted the help of the Insight Team
This was the final piece of the puzzle
but had in fact gone through their entire startup procedure
How it turned out
We indeed reached our goal
as opposed to the 50 minutes it used to take
Rolling out all the changes
how a small team can make a big difference fast at Netflix.
For such a wide-ranging and impactful project
touching all of our control plane services
As long as teams have no cross-regional dependencies
need to build in integration
some of which can be already be seen in
if a service needs an emergency dose of capacity
engage any failover capacity

kymr / daily-study

Week2 - Project Nimble: Region Evacuation Reimagined #7

Title

Summary

Reference

Words