usds / playbook

The Digital Services Playbook
https://playbook.cio.gov/
1.42k stars 330 forks source link

Advice for incident response, accident reporting, & post-mortems? #55

Closed mstone closed 10 years ago

mstone commented 10 years ago

Despite the best efforts of thousands of people, all sorts of things go wrong with the design and delivery of large-scale digital services. Therefore, what advice should we give to playbook readers about how to prepare, respond, & learn from things going wrong with their digital services?

wvchris commented 10 years ago

True, VA VISTA has had the error trap as a means of trapping errors, the symbol table at the time of the failure, the line of code usually indicating the command that experienced the error. The errors are kept by the time that they happened. I created an index in VISTA that would summarize the errors and index them by the location in the code that the error happened, and some detail about the kind of error. This provides a convenient place that the support person can link into the the error summary and get a profile of the errors as they happened and a place for the programmer to store the solution to the error for the next time it happens or if the problem is really fixed. A lot of the time it is a training issue of the people who might be using the software. I made no secret that these errors trap is open source.

cew821 commented 10 years ago

@wvchris Thanks for these comments. Your experience with VISTA seems interesting. Have you considered writing a blog post or two covering some of the key lessons learned from the experience?

I'd also be interested in checking out the code (I gather it's an open source project from your comment)? Where could we find that?

jallspaw commented 10 years ago

I have a very vested and enthusiastic interest in this topic. "Postmortem" debriefings and retrospective analysis needs to be carefully approached, lest it fall to the organizationally damaging and counterproductive methods of traditional "root cause analysis", which should be considered outright harmful in the practice of many complex domains, including software development.

I will draft a pull request that contains suggestions and thoughts on what an evolved "Learning Review" looks like. You will almost certainly expect to see elements of current research on "human error" and systems safety, à la http://codeascraft.com/2012/05/22/blameless-postmortems/.

cew821 commented 10 years ago

Thanks for the feedback! we've add a question to try to suss out how incidents are responded to (both real-time, and via post-mortems).

jallspaw commented 7 years ago

FYI, it took roughly two years to followup on my above comment, but we have just published what I referred to as "suggestions and thoughts on what an evolved 'Learning Review'" in the form of a generalized Debriefing Facilitation Guide.

It is located here (in both markdown and PDF) and we hope it can serve as a reference for the field on this topic. I'd like to suggest that the USDS and like-focused groups within the government could consider using this approach.

FWIW, there is more background here.

@mdickers47 @pahlkadot @jezhumble @haleyvandyck