openark / orchestrator

MySQL replication topology management and HA
Apache License 2.0
5.61k stars 926 forks source link

Relocating a slave - Adding some checks #303

Open tapuhi opened 6 years ago

tapuhi commented 6 years ago

When relocating a slave towards a new master no checks are made to ensure that slave can connect to the master (ip tables, firewalls,amazon security rules, etc ...)

So basically a relocation can be made and put that slave in a state where he cannot communicate with the new master, thus failing.

Orchestrator cannot do those checks , telnet for example, as it only communicate through the mysql port, unless orchestrator agent is installed.

As second solution we can consider is, doing to relocation and if that failed to connect to the master , rollback that action unless that for some reason the position of the re-located slave changed, which means that at some point in time it received traffic from the "new" master and then failed.

WDYT ?

sjmudd commented 6 years ago

I've bumped into this in the past a few times though not frequently.

I think it would be good to validate that the relocate has happened and that replication is working. Under normal circumstances that should not take long but you won't know exactly how long it will take so this check may be unreliable or you need to check for a (few) second(s) before concluding there's a problem.

That said if it looks like the relocate has failed and the replication position has not advanced then I'd also be of the opinion that going back to a "known good state" might be better than leaving the slave "broken" even if implementing this may be a bit fiddly.

tapuhi commented 6 years ago

Another use case of rollbacking unsuccessful failover: Shame on me :-), I was trying to understand why automatic failover didn't work on a file based replication. But what happened is that orchestrator began the process by stopping all slaves and only then failed to failover. So my cluster was in a state where I had stopped slaves and no failover running anymore.