vz250049 / go-spinner

MIT License
0 stars 0 forks source link

Incident - [Description of incident] #7

Open vz250049 opened 2 years ago

vz250049 commented 2 years ago

Roles

Incident Commander/SRE On-call: This person is responsible to handle the incident end to end.

Scribe: This person handles all communications and reports to the required channels and to product management to communicate to customers.

SME(Subject Matter Expert): This person will be from the SRE team in case of Infrastructure related issues or application SME in case of application issues and sometimes both. They will focus primarily on investigating the issue during the incident.

Step by step Response

Initial Response - (5-10m)

Once Incident Commander receives the alert/issue through Pager Duty or through any other source(slack channel etc.) he/she should judge (see To Incident, or not to Incident) if this comes under the incident category or it's a known issue. Once we quickly decide if this is an incident do the following tasks.

Initial Communication - (5-10m)

Investigation and Triage

On a regular basis the Incident Commander will evaluate the situation, take help from SMEs, and page out to any other needed teams. They will also ensure the group remains focused on the current incident at hand.

Communications should be sent out every 30 minutes. Add a comment to this issue to reflect when and what communications were sent.

Resolution

We've come to a conclusion for the incident, at least in the sense of stabilizing the environment. This phase is to close out any final actions or communications for the incident. It is important not to close the call right away, as there are things that can be important to keep a note of or required for future actions prior to problem management.