In the even of an incident, we should have a playbook of general actions that should be taken to mitigate problems and come to an eventual solution.
In addition to this, a "toolkit" should be created that make diagnosing issues across system simpler. An example of this would be some tool that could be run on any system that can extract basic information such as module hashes / s3cmd config pre configured to upload snapshots to our DO s3 buckets / etc.
Suggested Design
Basic outline for incident response
Could potentially use the most recent incident (hash mismatch) as the starting playbook
Can extend this to previous issues (gamm exploits)
Eventually once all previous incidents are covered, create playbooks for incidents that have not happened to us but have happened on other chains
Finally create playbooks for incidents that have not happened for us or other chains but is still possible
Create a toolkit that can be deployed on incident that makes some of these manual tasks more automated and in turn quicker
As explained above, this is stuff like creating a node off a snapshot (node to node transfer in same data center to make it faster), snapshot upload, export module hashes, export block events in a clean format, etc
Another part of this toolkit could be a google sheet that gets automatically generated with validator addresses / power. This is currently manually generated during every incident.
Background
In the even of an incident, we should have a playbook of general actions that should be taken to mitigate problems and come to an eventual solution.
In addition to this, a "toolkit" should be created that make diagnosing issues across system simpler. An example of this would be some tool that could be run on any system that can extract basic information such as module hashes / s3cmd config pre configured to upload snapshots to our DO s3 buckets / etc.
Suggested Design
Acceptance Criteria