blameless postmortems: when assigning blame, people will obfuscate the truth, which means we can't learn lessons from our failure.
devops is not tools, its people, its culture, its sharing. related to postmortems, focus on the sharing and culture aspect.
collecting data is cheap, not having it when you need it can be expensive. instrument EVERYTHING.
requirements on data/metrics
must be well understood
sufficient granularity
tagged and filterable - observability + queryability. be able to ask questions and find answers.
long lived
work metrics
resource metrics
events - not a metric, but used to understand metrics
recurse through metrics until you find a cause: you might get an alert because of a work metric, but it may be caused by a resource metric. there is no singular root cause.
if you're still responding, its not the time for a postmortem. wait until after.
who should be part of a postmortem?
responders
identifiers
affected users
data collection: what?
their perspective
what they did
what they thought
why they did it
when?
as soon as possible
memory drops within 20 minutes
susceptible to false memory
get your project managers involved
have people write down their stories instead of having a dialogue about them. include diagrams.
data skew
blame
fear of punitive action
people lie, numbers dont
humnan biases
outcome - disconnect things that happen from decisions that people make
hindsight
postmortem template
how was outage detected
did we have metrics that showed
were their monitors on that metrics
how long did it take for us to declare an outage
how did we respond
who was incident owner who else was involved
timeline of events
what went well
what didnt go so well
why did it happen
deep dive into the cause
technical aspect
really get to diggin
how do we prevent it in the future
now
next
later
follow up notes
q/a
try to ship out postmortem in 3 or so days
get draft down within 1 or 2 days, pass it around and everyone needs to sign off
data driven postmortems - datadog
blameless postmortems: when assigning blame, people will obfuscate the truth, which means we can't learn lessons from our failure.
devops is not tools, its people, its culture, its sharing. related to postmortems, focus on the sharing and culture aspect.
collecting data is cheap, not having it when you need it can be expensive. instrument EVERYTHING.
requirements on data/metrics
work metrics resource metrics events - not a metric, but used to understand metrics
recurse through metrics until you find a cause: you might get an alert because of a work metric, but it may be caused by a resource metric. there is no singular root cause.
if you're still responding, its not the time for a postmortem. wait until after.
who should be part of a postmortem?
data collection: what?
when?
have people write down their stories instead of having a dialogue about them. include diagrams.
data skew
postmortem template
how was outage detected
how did we respond
why did it happen
how do we prevent it in the future
q/a
additional reading
blameless postmortems by john allspaw
human side of postmortems by dave zwieback
https://bit.ly/postmortem-template
https://bit.ly/post-incident-review