Recommendations around fencing / STONITH

binwiederhier commented 3 years ago

We are experimenting with Orchestrator and the GitHub-style failover strategies (HAproxy + consul + consul-template) and so far it is working great. Thank you for making this great piece of software, and for describing your failover strategies in such great detail.

I've read various posts and questions around how to fence off a failed master/source host when Orchestrator has begun a failover/recovery, but it's not clear if there is a recommendation from you of how to do it, so if time permits, @shlomi-noach, I'd like to hear your thoughts on best practices here.

What I've implemented currently in my dev setup: On the host that runs HAproxy, I have a tiny python script that queries Orchestrator's /api/all-instances every 5 seconds and caches the results. I added an external-check into HAproxy that reads that file and fails if the host is marked downtimed. This works, but I believe is not ideally fencing the old source off, because I believe the IsDowntimed flag is not set early enough.

What I think would be ideal: If Orchestrator could update the KV store the second it detects the DeadMaster event (or others) with an empty value, or something else indicating that this host has been marked dead, we could update the consul-template to produce a dead-end-type server entry in HAproxy. Sure this can likely be implemented via PreFailoverProcesses but I'd feel much better if this was an Orchestrator feature (potentially opt-in). Heck if you think it's a good idea I'd even be willing to implement it, will likely not be that hard, right?

Again, thanks for your time and this great project!

jrsmiley commented 3 years ago

STONITH has many known issues with reliability and is generally not a recommended approach to ensuring that a member leaves a group. The best practice is to have a lease protocol where a node that cannot communicate with the lease arbiter is to SMITH.

shlomi-noach commented 3 years ago

how to fence off a failed master/source host when Orchestrator has begun a failover/recovery,

anything based on orchestrator's hooks, or on a proxy detecting the change etc. will only reduce the time the failed primary can receive writes. That is, fencing will take place sooner rather than later, but is it soon enough? What is soon enough? If you're good with letting the old primary receive writes for 1sec or for 100ms, you're still letting it deviate from the newly promoted server. So first thing is to understand what you're trying to accomplish.

If you wish to completely fence the old primary, such that it will not take a single write beyond what's in the newly promoted primary, then semi-sync replication is probably what you're looking for, with "infinite" (very large) value for rpl_semi_sync_master_timeout. I recently wrote some about semi-sync here, and see links to blog posts by JFG.

openark / orchestrator

Recommendations around fencing / STONITH #1275