Firedrills and failure simulations

mmcgrana commented 10 years ago

Firedrills, failure simulations, chaos monkeys, and before/after failure testing.

mmcgrana commented 10 years ago

I think there are several buckets here:

GameDay firedrills: A one-time, manual test in which you simulate a failure and attempt to resolve it, and then process the results of that into potential system improvements. Best resource I've found so far on this is: Learning to Embrace Failure (Limoncelli et al.). Would love to find others though.

Fault injection: An ongoing, programatic test regime in which software injects faults into the system at runtime. The most obvious resource here is the post Chaos Monkey Released Into The Wild (Bennett and Tseitlin), but again would be great to see more.

Before/after failure testing: An important pattern when fixing system problems is: 1) observe the problem (say in production), 2) reproduce the problem (say in a dev environment) in the form of a "failing" simulation, 3) develop the fix, and 4) test the fix against the same simulation that previously failed, to ensure it's actually been fixed. This full flow is often reduced to just 1+3, which is problematic. Would be great to have a resource specifically speaking to this discipline. I don't know of one though.

mmcgrana commented 10 years ago

A paper on production fault inject from Berkley:

Failure as a Service (FaaS): A Cloud Service for Large- Scale, Online Failure Drills (Gunawi et al.)

chooper commented 10 years ago

GameDay firedrills: I think we'll find alot more literature for GameDay firedrills by searching for Disaster Recovery. Indeed, a quick search on ACM yields Weathering the Unexpected (Limoncelli again), a paper about how Google performs routine disaster recovery exercises. DR has a ton of academic research around it as well, along with its close cousin business continuity.

Fault injection: What about Allspaw? This is one of his oft-mentioned topics as well. Here's a link to his paper Fault Injection in Production: Making the case for resilience testing. As the title suggests, this paper is about justifying doing such a thing and doesn't really include how it's done in practice.

mmcgrana commented 10 years ago

Wow great suggestions @chooper, I'll definitely check these out!

mmcgrana commented 10 years ago

Added a chunk of links, included the 2 you suggested @chooper.

I think this provides pretty good coverage, except for the "before/after failure testing" case mentioned above.

Will keep this open as we continue to noodle on it / search.

chooper commented 10 years ago

Nice, I'll see what I can dig up for the last category

mmcgrana / services-engineering

Firedrills and failure simulations #31