problems with relocation of intermediate master

sjmudd commented 6 years ago

Comment from a colleague (Daniel) seen on orchestrator version: 3.0.1.X (patched 3.0.1 version)

Local patches in 3.0.1.X do not affect recovery behaviour.

Setup:

Master → DC Master → 2 Slaves
With IM auto recovery enabled.

On the DC Master block access from the active (or all) orchestrator machines:

Block
sudo iptables -I INPUT 1 -s orchestrator.example.com -p tcp --dport 3306 -j DROP

This results in:

dbmeta-wont-relocate

Orchestrator doesn't do anything yet as the slaves are still replicating. This is expected.

Now when I click "Auto (implies running external hooks/processes)" it is busy for a bit but doesn't actually do anything.

From the logs:

2018-04-05 14:24:58 ERROR ReadTopologyInstance(dbmeta-1003.example.com:3306): dial tcp 10.A.B.C:3306: i/o timeout
2018-04-05 14:24:58 WARNING discoverInstance(dbmeta-1003.example.com:3306) instance is nil in 1.013s (Backend: 0.012s, Instance: 1.000s), error=dial tcp 10.A.B.C:3306: i/o timeout
2018-04-05 14:24:58 WARNING executeCheckAndRecoverFunction: ignoring analysisEntry that has no action plan: UnreachableIntermediateMaster; key: dbmeta-1003.example.com:3306

Now when I click "Relocate replicas to dbmeta-2001.example.com:3306" it relocates the two boxes almost instantly.

To unblock again after testing:

Unblock
sudo iptables -D INPUT -s orchestrator.example.com -p tcp --dport 3306 -j DROP

I've not been able to update orchestrator to the latest code yet given it broke (when tested in production under our load) when I tried this a few weeks ago. I hope to look at that shortly.

If you need further information please let me know.

sjmudd commented 6 years ago

Clearly the recovery process can take quite some time and when calling external hooks (as we do) this may take "longer". It would be nice to get some feedback on progress in the web ui if that's possible as otherwise it's not clear what is happening.

Checking the configured hooks for intermediate masters I see:

  "PostIntermediateMasterFailoverProcesses": [
  ],

I'm not sure if there's a pre-process but if there is it's not configured. So it looks as if there are no failover hooks here to slow things down or to perform magic which might affect orchestrator.

shlomi-noach commented 6 years ago

So the problem here is different, if I may:

orchestrator doesn't have a failover plan for UnreachableIntermediateMaster (or else it would run hundreds of needless failovers every day)
orchestrator didn't tell you that when you pressed "recover".

Let me look into the latter.

By the way, and just FYI, we run with master_heartbeat of 4-5 seconds, which means such a failure case is detected by replicas in less than 10 seconds.

shlomi-noach commented 6 years ago

From experimenting locally I actually see a different behavior: I'm causing an UnreachableMaster scenario, then in GUI clicking "recover", and getting a "Recovery not attempted" in the notification bar. Which, while not too informative, does give me a response.

So I'm unable to reproduce you scenario.

openark / orchestrator

problems with relocation of intermediate master #461