shimmerjs / allthingsopen-2017

Information, discussion, notes from All Things Open 2017
0 stars 0 forks source link

10/24: Data-driven Postmortems #8

Open shimmerjs opened 6 years ago

shimmerjs commented 6 years ago

data driven postmortems - datadog

blameless postmortems: when assigning blame, people will obfuscate the truth, which means we can't learn lessons from our failure.

devops is not tools, its people, its culture, its sharing. related to postmortems, focus on the sharing and culture aspect.

collecting data is cheap, not having it when you need it can be expensive. instrument EVERYTHING.

requirements on data/metrics

  1. must be well understood
  2. sufficient granularity
  3. tagged and filterable - observability + queryability. be able to ask questions and find answers.
  4. long lived

work metrics resource metrics events - not a metric, but used to understand metrics

recurse through metrics until you find a cause: you might get an alert because of a work metric, but it may be caused by a resource metric. there is no singular root cause.

if you're still responding, its not the time for a postmortem. wait until after.

who should be part of a postmortem?

data collection: what?

when?

have people write down their stories instead of having a dialogue about them. include diagrams.

data skew

postmortem template

q/a

additional reading

blameless postmortems by john allspaw

human side of postmortems by dave zwieback

https://bit.ly/postmortem-template

https://bit.ly/post-incident-review