Open sjmudd opened 6 years ago
Clearly the recovery process can take quite some time and when calling external hooks (as we do) this may take "longer". It would be nice to get some feedback on progress in the web ui if that's possible as otherwise it's not clear what is happening.
Checking the configured hooks for intermediate masters I see:
"PostIntermediateMasterFailoverProcesses": [
],
I'm not sure if there's a pre-process but if there is it's not configured. So it looks as if there are no failover hooks here to slow things down or to perform magic which might affect orchestrator.
So the problem here is different, if I may:
orchestrator
doesn't have a failover plan for UnreachableIntermediateMaster
(or else it would run hundreds of needless failovers every day)orchestrator
didn't tell you that when you pressed "recover".Let me look into the latter.
By the way, and just FYI, we run with master_heartbeat
of 4-5
seconds, which means such a failure case is detected by replicas in less than 10
seconds.
From experimenting locally I actually see a different behavior: I'm causing an UnreachableMaster
scenario, then in GUI clicking "recover", and getting a "Recovery not attempted" in the notification bar.
Which, while not too informative, does give me a response.
So I'm unable to reproduce you scenario.
Comment from a colleague (Daniel) seen on orchestrator version: 3.0.1.X (patched 3.0.1 version)
Setup:
On the DC Master block access from the active (or all) orchestrator machines:
sudo iptables -I INPUT 1 -s orchestrator.example.com -p tcp --dport 3306 -j DROP
This results in:
Orchestrator doesn't do anything yet as the slaves are still replicating. This is expected.
Now when I click "Auto (implies running external hooks/processes)" it is busy for a bit but doesn't actually do anything.
From the logs:
Now when I click "Relocate replicas to
dbmeta-2001.example.com:3306
" it relocates the two boxes almost instantly.To unblock again after testing:
sudo iptables -D INPUT -s orchestrator.example.com -p tcp --dport 3306 -j DROP
I've not been able to update orchestrator to the latest code yet given it broke (when tested in production under our load) when I tried this a few weeks ago. I hope to look at that shortly.
If you need further information please let me know.