openark / orchestrator

MySQL replication topology management and HA
Apache License 2.0
5.61k stars 927 forks source link

problems with relocation of intermediate master #461

Open sjmudd opened 6 years ago

sjmudd commented 6 years ago

Comment from a colleague (Daniel) seen on orchestrator version: 3.0.1.X (patched 3.0.1 version)

Setup:

On the DC Master block access from the active (or all) orchestrator machines:

This results in:

dbmeta-wont-relocate

Orchestrator doesn't do anything yet as the slaves are still replicating. This is expected.

Now when I click "Auto (implies running external hooks/processes)" it is busy for a bit but doesn't actually do anything.

From the logs:

2018-04-05 14:24:58 ERROR ReadTopologyInstance(dbmeta-1003.example.com:3306): dial tcp 10.A.B.C:3306: i/o timeout
2018-04-05 14:24:58 WARNING discoverInstance(dbmeta-1003.example.com:3306) instance is nil in 1.013s (Backend: 0.012s, Instance: 1.000s), error=dial tcp 10.A.B.C:3306: i/o timeout
2018-04-05 14:24:58 WARNING executeCheckAndRecoverFunction: ignoring analysisEntry that has no action plan: UnreachableIntermediateMaster; key: dbmeta-1003.example.com:3306

Now when I click "Relocate replicas to dbmeta-2001.example.com:3306" it relocates the two boxes almost instantly.

To unblock again after testing:

I've not been able to update orchestrator to the latest code yet given it broke (when tested in production under our load) when I tried this a few weeks ago. I hope to look at that shortly.

If you need further information please let me know.

sjmudd commented 6 years ago

Clearly the recovery process can take quite some time and when calling external hooks (as we do) this may take "longer". It would be nice to get some feedback on progress in the web ui if that's possible as otherwise it's not clear what is happening.

Checking the configured hooks for intermediate masters I see:

  "PostIntermediateMasterFailoverProcesses": [
  ],

I'm not sure if there's a pre-process but if there is it's not configured. So it looks as if there are no failover hooks here to slow things down or to perform magic which might affect orchestrator.

shlomi-noach commented 6 years ago

So the problem here is different, if I may:

Let me look into the latter.

By the way, and just FYI, we run with master_heartbeat of 4-5 seconds, which means such a failure case is detected by replicas in less than 10 seconds.

shlomi-noach commented 6 years ago

From experimenting locally I actually see a different behavior: I'm causing an UnreachableMaster scenario, then in GUI clicking "recover", and getting a "Recovery not attempted" in the notification bar. Which, while not too informative, does give me a response.

So I'm unable to reproduce you scenario.