Rate limit orchestrator recoveries even across topologies

For situations like this a brake is good.

So limit automatic failure if you hit X failures in S seconds.
This brake should need to be (configurable) manually disabled. that is once the brake is applied you MUST explicitly disable it.
I notice there's a --noop option but that requires getting onto the orchestrator server and changing orchestrator.conf.json, and restarting orchestrator. If you're in a cluster another orchestrator process is likely to take over so this does not work as immediately as you might hope.

Consequently I'd be tempted to have a storage setting for "globalAutomaticRecoveryDisabled" which is read every few seconds and the running/active node will take that into consideration. The GUI should also have a way to change this setting: "GlobalAutomaticRecovery: Disabled/Enabled" which updates this table, and an appropriate CLI entry to query/enable/disable this behaviour, perhaps with a hook to notify people of the change in state.

This is a long list of things I would like to see. It may not seem useful to have all of this but a global failure such as a DC failure may make this sort of brake quite useful.

outbrain / orchestrator

Rate limit orchestrator recoveries even across topologies #206