Open shlomi-noach opened 8 years ago
For situations like this a brake is good.
Consequently I'd be tempted to have a storage setting for "globalAutomaticRecoveryDisabled" which is read every few seconds and the running/active node will take that into consideration. The GUI should also have a way to change this setting: "GlobalAutomaticRecovery: Disabled/Enabled" which updates this table, and an appropriate CLI entry to query/enable/disable this behaviour, perhaps with a hook to notify people of the change in state.
This is a long list of things I would like to see. It may not seem useful to have all of this but a global failure such as a DC failure may make this sort of brake quite useful.
Have a max-recoveries-per-hour limitation or similar. Even across clusters, we may wish to get a human involved in such case where there's just too many things breaking concurrently.