openark / orchestrator

MySQL replication topology management and HA
Apache License 2.0
5.64k stars 933 forks source link

returning a downed master to cluster`` #1293

Open cmusser opened 3 years ago

cmusser commented 3 years ago

Hi,

I have a topology with a master and two replicas where I'm testing master server failures. The recovery process isn't doing what I'd expect. When I shutdown MySQL on the master, one of replicas becomes the master, which is good. But when I start the old master back up, I'd like it to reappear in the cluster, and, ideally, begin replicating the new master.

My Orchestrator version:

3.2.3
7e183c77882bab9c2bf39804328a3409f5ae8ab3

I started with this initial topology:

--- ~ » orchestrator-client -c topology -i dbtest01
dbtest01:3306 (dbtest01)   [0s,ok,5.6.50-90.0-log,rw,ROW,>>,P-GTID]
+ dbtest02:3306 (dbtest02) [0s,ok,5.6.50-90.0-log,ro,ROW,>>,P-GTID]
+ dbtest03:3306 (dbtest03) [0s,ok,5.6.50-90.0-log,ro,ROW,>>,P-GTID]

Next I stopped the MySQL server on dbtest01. The topology now looks like:

--- ~ » orchestrator-client -c topology -i dbtest01
dbtest01:3306 (dbtest01) [unknown,invalid,5.6.50-90.0-log,rw,ROW,>>,P-GTID,downtimed]

--- ~ » orchestrator-client -c topology -i dbtest02
dbtest02:3306 (dbtest02)   [0s,ok,5.6.50-90.0-log,rw,ROW,>>,P-GTID]
+ dbtest03:3306 (dbtest03) [0s,ok,5.6.50-90.0-log,ro,ROW,>>,P-GTID]

There is still just the one cluster (named test-cluster) in the web portal.

I restart the MySQL server on dbtest01. I notice that dbtest01 appears as its own cluster in the web console. The topology commands now show:

--- ~ » orchestrator-client -c topology -i dbtest01
dbtest01:3306 (dbtest01) [0s,ok,5.6.50-90.0-log,rw,ROW,>>,P-GTID,downtimed]

--- ~ » orchestrator-client -c topology -i dbtest02
dbtest02:3306 (dbtest02)   [0s,ok,5.6.50-90.0-log,rw,ROW,>>,P-GTID]
+ dbtest03:3306 (dbtest03) [0s,ok,5.6.50-90.0-log,ro,ROW,>>,P-GTID]

As it stands, the topology has split into two separate clusters, which are:

--- ~ » orchestrator-client -c clusters-alias
dbtest01:3306,dbtest01:3306
dbtest02:3306,test-cluster

Is that what is supposed to happen? What are the steps for getting dbtest01 back in action as a replica? I did that manually by snapshotting the new master with xtrabackup, restoring that on dbtest01 and restarting replication. But I'd hoped the manual steps wouldn't be needed.

I attached the config and the contents of the metadata table that Orchestrator uses.

meta.txt orchestrator.txt

I can post logs as needed. But I think I'm not understanding something about how this is supposed to work.

shlomi-noach commented 3 years ago

orchestrator does not handle the reprovisioning of the old primary, and that is left for the user. TL;DR this is outside the scope of orchestrator, and would require an agent running on the host, with likely root access required and with a rather intimate knowledge of your setup/infrastructure.

there have been two suggestions to making orchestrator coordinate backup/restores; both kind of tripled the codebase and I'm afraid are not in the capacity of this project.

cmusser commented 3 years ago

Ok, that is good to know. I wasn't actually sure what Orchestrator would do in the scenario where the primary server vanishes. The fact that it promoted a replica without intervention is very helpful. Making the code 3x bigger to completely recover doesn't seem worth it, for sure. Too many site-specific aspects to deal with.

When the downed server started back up, Orchestrator put it into its own cluster and we were wondering why that was. We figured that since that server existed in the little metadata database we created for Orch (the one used by the Detect config directives), that it would return to the cluster, but in a non-replicating and downtimed state. It is downtimed, but off it a separate cluster, as if Orchestrator retained no knowledge of it. One thing I did see was an ack-cluster-recovery command. Would issuing that command before restarting the dead server allow it to be recognized as part of the cluster?

liortamari commented 3 years ago

@shlomi-noach we hit this issue as well. Can you explain why the restarted master is seen as a different cluster when we re-discover it, even though DetectClusterAliasQuery is the same? Perhaps it will help us find the right solution.

@cmusser did you find a way to resolve this issue?

shlomi-noach commented 3 years ago

we hit this issue as well.

@liortamari can you first explain what is the issue you're hitting? The original comment illustrated a scenario, but there was no real issue, other than the user expecting the old primary to now replicate from the new primary.

cmusser commented 3 years ago

I think what @shlomi-noach is saying here is:

  1. If you restart a server, the expected behavior is that it will reappear as its own cluster in Orchestrator.
  2. The administrator must manually restart replication and request that Orchestrator re-discover that server as a replica.

That's my understanding of it anyway.

liortamari commented 3 years ago

@shlomi-noach thank you, the issue I am trying to understand how to best resolve is a scenario where a master restarted due to an error. For example, i have a cluster with 2 instances:

orchestrator-client -c all-instances

mysql-misc-a:3306 mysql-misc-b:3306

orchestrator-client -c clusters-alias

mysql-misc-a:3306,mysql-misc

orchestrator-client -c all-clusters-masters

mysql-misc-a:3306

When the master mysql-misc-a is restarted, the slave mysql-misc-b is promoted to master as expected. And now the orchestrator state shows 2 cluster aliases:

orchestrator-client -c clusters-alias

mysql-misc-a:3306,mysql-misc-a:3306 mysql-misc-b:3306,mysql-misc

orchestrator-client -c all-instances

mysql-misc-a:3306 mysql-misc-b:3306

orchestrator-client -c all-clusters-masters

mysql-misc-b:3306

My question, is there a command that I can run to tell orchestrator to take the old master mysql-misc-a as replication slave under the new master? or must I configure replication outside of orchestrator scope like @cmusser suggested earlier? In general, I was surprised after the restart I have 2 cluster aliases because according to the DetectClusterAliasQuery it should be the same cluster. so I was hoping to better understand what is the best way to remedy it, preferably by using orchestrator-client only

shlomi-noach commented 3 years ago

@liortamari thank you for elaborating.

@cmusser is correct about (1). Why does this happen? MySQL-wise, there's actually no such notion as a "cluster". MySQL does not care about clusters (in async/semisync replication), only about one server replicating from another. So this is a metadata we decorate your topology with, and that's done via DetectClusterAliasQuery. So far so good.

Now, a primary failed, promotion took place. 5 minutes or 5 hours later the primary comes back to life. What happens now? MySQL has no insights. It's down to orchestrator to make the best of a situation. Here's what it knows:

  1. There are n instances, all claiming to be in mysql-misc cluster
  2. n-1 of those instances are connected in a replication graph (something orchestrator is able to identify)
  3. 1 instance is not connected with the rest of them.
  4. Due to the nature of async/semisync replication, it is quite possible that the one instance cannot connect with the rest of them because it has excessive transactions
    • With GTID we can actually find that out, let's discuss later
  5. So, MySQL-wise, there are 2 servers which act as primaries
  6. orchestrator-wise, both claim to be the head of mysql-misc
  7. But obviously, they can't both belong to the same cluster if they are not connected in some replication graph. Which they are not.

This is why you see two clusters.

Now orchestrator needs to figure out which is the "real" cluster, which it does by:

  1. Remembering there was a failover
  2. Marking the old primary as lost-in-recovery (and internal downtime/tag)

That's how orchestrator decides in the event of post-failover scenario. There can be other scenarios, where multiple clusters all pretend to be the same one, and orchestrator would choose the largest, by way of heuristic. But that's orthogonal to our discussion.

Anyway. If the old primary has transactions not present in the new cluster, then there is nothing orchestrator can do. There's just no way to make it a happy replica in the cluster. You will have to e.g. restore the server from backup. Also, it's imperative that orchestrator doesn't do anything, because your business is likely to want to salvage those lost records.

Now, as I mentioned earlier, there is one exception. If:

  1. using MySQL GTID
  2. old primary is found again
  3. and does not have any extra transactions
  4. And (you have configured ReplicationCredentialsQuery) or (have granted SELECT privileges on mysql.slave_master_info and have configured master_info_repository = 'TABLE'), so that `orchestrator is able to get some idea on how to configure a server as a replica,

Then, it's possible to reconfigure it as a replica and connect it back to the cluster. It's a bit complex, because what happens if that old primary returns after 5 hours? Do we keep tracking forever? Anyway, it's an idea

liortamari commented 3 years ago

@shlomi-noach thank you, Conditions 1-4 are met in my test. I would think the most important thing to do, in case a dead master reappears, is to mark it read-only, right? And I see orchestrator does that. So it seems to me orchestrator does need to keep that tracking forever, even if for the read_only alone, is that correct?

shlomi-noach commented 3 years ago

@liortamari "forever" is a strong word. Even if orchestrator does keep checking till the old server reappears, it will do so in intervals. There will be a period of time where that old seerver would still be writable, before orchestrator turns it read_only. And this is nice to have, but please consider, what problem does this solve?

This does not solve split brains -- it narrows the split brain time a bit. This does not restore the server into your topology. This does not normalize/align the data on the reappearing server.

I think the discussion is digressing. The intent was to "tell orchestrator to take the old master mysql-misc-a as replication slave under the new master".

To avoid split brains, to ensure that the old primary never has more transactions than the newly promoted server, you must use semi-sync, and pay the price for commit latency. If you want the same for cross-region, pay the price for cross-region latency.

I'm not against orchestrator setting read_only=1 on reappearing servers. Just pointing out that this does not solve any of the questions above.

liortamari commented 3 years ago

@shlomi-noach thanks for the explanation. I mentioned the read_only because I noticed that orchestrator did set read_only=1 to the reappearing master. So I thought that was part of the logic. I now understand it is not.