Incident Response Playbook / Toolkit

czarcas7ic commented 2 years ago

Background

In the even of an incident, we should have a playbook of general actions that should be taken to mitigate problems and come to an eventual solution.

In addition to this, a "toolkit" should be created that make diagnosing issues across system simpler. An example of this would be some tool that could be run on any system that can extract basic information such as module hashes / s3cmd config pre configured to upload snapshots to our DO s3 buckets / etc.

Suggested Design

Basic outline for incident response
- Could potentially use the most recent incident (hash mismatch) as the starting playbook
- Can extend this to previous issues (gamm exploits)
- Eventually once all previous incidents are covered, create playbooks for incidents that have not happened to us but have happened on other chains
- Finally create playbooks for incidents that have not happened for us or other chains but is still possible
Create a toolkit that can be deployed on incident that makes some of these manual tasks more automated and in turn quicker
- As explained above, this is stuff like creating a node off a snapshot (node to node transfer in same data center to make it faster), snapshot upload, export module hashes, export block events in a clean format, etc

Acceptance Criteria

All above points are addressed

czarcas7ic commented 2 years ago

Another part of this toolkit could be a google sheet that gets automatically generated with validator addresses / power. This is currently manually generated during every incident.

ValarDragon commented 1 year ago

I think we should:

Make a list of all incidents we've ran into
Make an outline for how the playbook / toolkit should work (and layout some of our decision trees)
Get started trying out an incident or two, ratify a plan, and iterate on format :)

osmosis-labs / osmosis