opentensor / subtensor

Bittensor Blockchain Layer
The Unlicense
123 stars 131 forks source link

SOP for recovering from outage #438

Open orriin opened 2 months ago

orriin commented 2 months ago

The saying "prevention is better than the cure" overwhelmingly applies when it comes to chain outages, and we must always first and foremost do everything possible to avoid them happening in the first place.

However, at the end of the day we are all humans who make mistakes, and even the largest chains (Solana, Polkadot, Bitcoin) have at points experienced devastating outages and required intervention from developers to get back online.

A chain outage, even though unlikely, is a catastrophic event making it imperative that we are prepared for the occurrence and have an SOP ready to action in the event that we need to rollback the chain.

The SOP should define clear steps undertaken by 3 actors

to swiftly restore chain operation, keep the community updated, and eventually publish a post-mortem and ensure steps are put in place to prevent re-occurrence of the issue.