openark / orchestrator

MySQL replication topology management and HA
Apache License 2.0
5.56k stars 921 forks source link

RFC: Respond to DC isolation, make isolated master(s) read_only #482

Open shlomi-noach opened 6 years ago

shlomi-noach commented 6 years ago

orchestrator/raft mitigates DC isolation by having orchestrator nodes in each DC, and letting the raft mechanism ensure the leader is always non-isolated.

In a 3 DC setup, for example, DC1 can be network isolated, such that DC2 and DC3 will form a consensus. If a master was running in DC1, it will not be accessible to the external world. Dc2/DC3 quorum will elect a new master out of DC2/DC3.

But what happens to the internal network in DC1? Apps in DC1 can still write to the master. True, DC1's old master will be eventually thrown away and recycled, but meanwhile app logic running in DC1 may make incorrect assumptions given that it was able to write to a master. And when DC1 network recovers, DC1's app may be in inconsistent state.

Mitigation idea: turn isolated master to read_only

Then it should be safe to assume the orchestrator node's DC is experiencing network partitioning.

In which case orchestrator should reach to its local masters and set them to read-only=1.

Such an operation would kick off in a scope of seconds after failure.

Risks

Setting masters to read-only=1 would be very unfortunate if the logic is incorrect.

I'm happy for reviews on the above conditions. Are they too permissive? Subject to some simple scenario where a false positive can happen?

cc @jonahberquist @ggunson @tomkrouper @gtowey @sjmudd @samiahlroos @grypyrg @pboros

igroene commented 6 years ago

I think that is a very nice feature to have, and it would make sense to have a new parameter introduced to enable/disable this behaviour. Tungsten cluster does something similar and works pretty well in my experience.

sjmudd commented 6 years ago

If not using raft the situation is slightly different as orchestrator will only work if the app can reach the backend MySQL store, so in the end only those apps that can reach it will work and one of those will be chosen to be active: in this case orchestrator won't care if > 50% of the servers are unavailable it will do what it can to restore things where it is able. This may or may not be good.

I can see merit in trying to disable writes to the "masters" in the isolated zone/DC but that could also get triggered if a couple of the apps die for any reason, so there's some risk in this not working as expected if orchestrator fails in any way or the systems around orchestrator fail in any way. That's a concern.

Having seen orchestrator kick in when you have issues on a large number of servers it's quite stressful as orchestrator reacts much faster than we can hope to do and it takes us time to figure out what's going on. In the few cases I've seen this happen I've preferred to turn off orchestrator, not because it did anything wrong but because I wasn't sure. Unless the logic is absolutely correct and very well tested this will bite you. That may be better than writing into 2 different masters, or not but think carefully of the consequences of enabling something like this.

The balance here is valuing the cost of not being able to write to valid master (false positive) with the cost of writing into 2 isolated clusters (doing nothing). To truly know the answer here you need to be bitten by both problems to be sure which is worse...

Everyone thinks it's easy and as we've seen with orchestrator things in real life are more complex than they seem.

jonahberquist commented 6 years ago

I like this idea.

One thing not explicitly mentioned in the description that I think is worth calling out is that while orchestrator usually detects things on individual clusters, this scenario could use orchestrator's understanding of all clusters.

That is, if orchestrator knows about multiple clusters and this happens on only one of them, it shouldn't do anything. If, on the other hand, all clusters exhibit the same DC isolation symptoms, then we have a much higher degree of confidence that it is correct to take action.

shlomi-noach commented 6 years ago

@sjmudd this is only planned for orchestrator/raft and at this time I've given no thought to non-raft setup.

I can see merit in trying to disable writes to the "masters" in the isolated zone/DC but that could also get triggered if a couple of the apps die for any reason, so there's some risk in this not working as expected if orchestrator fails in any way or the systems around orchestrator fail in any way. That's a concern.

As per the list of conditions above this STONITH won't get kicked if apps die for any reason. It will take a lot more for the isolated orchestrator node to make the call.

grypyrg commented 6 years ago

I like the idea. The validation Orchestrator would provide in this suggestion is beyond what any other popular async replication automation tools provide right now. Plenty of checks are proposed to prevent wrong decisions. πŸ‘

There is an old/unpopular solution that does it 'better'. Let me share that here: At Percona, I have done setups with clients with Pacemaker clusters in 3 DCs, running PRM (percona replication manager) and connecting the pacemaker clusters through the booth protocol (raft). Because Pacemaker was used, nodes relied on Pacemakers integrated quorum for split-brain prevention/fencing. Across datacenters, booth tickets (eg. raft leadership election) were issued and a primary datacenter was chosen, who was the only DC that could have a writer/master role. I don't necessarily recommend this setup anymore, it is very complex (pacemaker is) and has plenty of it's own challenges, but it did prevent any isolation this issue aims on reducing (not solving). This required pacemaker to run on every database node and have pacemaker manage the mysql servers which is quite error prone in practice to be honest. This definitely would not work in larger database environments.

Both Galera & Group Replication do have quorum-based network partition handling integrated by design, but certainly does not fit all use cases of async replication and hence cannot be seen as a full drop in replacement to async replication as there are too many limitations, application behavioral differences, scale and performance implications.

To summarise my ramblings: So while what this proposal describes implementing would make things a lot better out of the box, without complicating the database architecture, it will not prevent all cases of network partitioning and as @sjmudd mentioned, could also be risky.

That's where from my point of view, when network partitioning and split brain prevention becomes critical to the business,

shlomi-noach commented 6 years ago

@grypyrg thank you for the elaborate comment. A couple questions and comments:

it will not prevent all cases of network partitioning

I do agree, and I'm aiming for a specific case where a DC is completely network partitioned from all others, but I'm curious to hear what other scenarios you had in mind (and I'll refrain from sharing mine).

or with async replication, have a more complex environment, by running a daemon on the database servers itself that would be responsible for network partition handling

This sounds like a safer method, but really isn't: when you run a daemon per database box, then quorum is based on the number of servers in the topology. But we're interested in drawing conclusions by data centers, not by individual servers. As an extreme example, consider what happens when a specific DC has way more servers than in other DCs. In case that DC gets network isolated, the servers in that DC will still form a quorum, leading to the wrong conclusion that everything's fine with the DC. In fact, servers in other DCs will start shutting down (or turn read_only). This is the opposite from the action we would like to see.

As you suggested, it is a more complex setup (more servers at play, daemons on all MySQL servers, handling of servers coming in and going out) which makes it undesirable or out of scope for orchestrator.

when network partitioning and split brain prevention becomes critical to the business

The discussion begins by acknowledging this is a non-binary condition.

If split brain prevention is absolutely critical, then I agree that cross-DC consensus based protocol on the MySQL level (ie GR/Galera/PXC) is the correct solution.

That does come with a price tag: commit latency and throughput are both limited by (multiple) cross-DC network travel time.

However, if we say that a split brain is "quite undesirable", and assuming we're unwilling to pay the price fro cross-DC consensus latency, and depending on our apps behavior, then we can consider the alternatives, as we are doing here. As you point out, the solution offered in this Issue does not eliminate split brain, it mitigates or caps the time during which an isolated DC will continue operating. This means smaller drift on that DC and hopefully less (manual?) labor to fix it.

grypyrg commented 6 years ago

it will not prevent all cases of network partitioning

I do agree, and I'm aiming for a specific case where a DC is completely network partitioned from all others, but I'm curious to hear what other scenarios you had in mind (and I'll refrain from sharing mine).

Some thoughts:

A. When this solution would be used with for example 3 node orchestrator raft, each in a different datacenter:

B. Another item I was thinking about: It was mentioned it would kick off in a couple of seconds. Maybe a couple of seconds will create too many false positives when used over a less-reliable WAN. (of course it depends on the environment) and the leadership timeout in raft setting should be set to maybe even 1-2 minutes. This will of course slow down the failover operation and possibly increase the amount of writes to the isolated master.

C. Same setup: 3 DC/s w Orchestrator/Raft. (I lack handson orchestrator experience which might otherwise potentially answer some scenarios)

D. Can/should this solution be used in an orchestrator/raft solutions within a non-multi datacenter setup? I believe this becomes more difficult and there are more edge cases and maybe by design should be prevented to even be an option for the end-user. It might work, but it will not prevent all scenarios from successfully 'demoting' a master node to become super_read_only=1 (similar to the scenario in A)

or with async replication, have a more complex environment, by running a daemon on the database servers itself that would be responsible for network partition handling

This sounds like a safer method, but really isn't: when you run a daemon per database box, then quorum is based on the number of servers in the topology. But we're interested in drawing conclusions by data centers, not by individual servers. As an extreme example, consider what happens when a specific DC has way more servers than in other DCs. In case that DC gets network isolated, the servers in that DC will still form a quorum, leading to the wrong conclusion that everything's fine with the DC. In fact, servers in other DCs will start shutting down (or turn read_only). This is the opposite from the action we would like to see.

Agreed. Most environments I worked with do work this way where there is a primary and secondary datacenter and it is a manual operation to activate this secondary datacenter (which usually has less servers, or at least a <50% quorum weight). This situation is common in PXC/Galera setups as well.

Likely I caused the conversation to drift off into different scenarios where you are only focussing on automatic datacenter failover and I see how in practice this will be used in various weird setups.

As you suggested, it is a more complex setup (more servers at play, daemons on all MySQL servers, handling of servers coming in and going out) which makes it undesirable or out of scope for orchestrator.

when network partitioning and split brain prevention becomes critical to the business

The discussion begins by acknowledging this is a non-binary condition.

πŸ‘

If split brain prevention is absolutely critical, then I agree that cross-DC consensus based protocol on the MySQL level (ie GR/Galera/PXC) is the correct solution.

πŸ‘

That does come with a price tag: commit latency and throughput are both limited by (multiple) cross-DC network travel time.

πŸ‘

However, if we say that a split brain is "quite undesirable", and assuming we're unwilling to pay the price fro cross-DC consensus latency, and depending on our apps behavior, then we can consider the alternatives, as we are doing here. As you point out, the solution offered in this Issue does not eliminate split brain, it mitigates or caps the time during which an isolated DC will continue operating. This means smaller drift on that DC and hopefully less (manual?) labor to fix it.

πŸ‘ Completely agree! I didn't want to sound like this is not a good proposal, but wanted to position it in the "quite undesirable" space. I've worked with plenty of more traditional and/or enterprise customers in the past where the requirement is "consistency" first.

shlomi-noach commented 6 years ago

BTW I should mention shared disk (via synchronous block replication?) as another lossless solution (a-ala DRBD, or Ceph, or what have you). For vanilla MySQL servers this means long failover time since promoted master (not promoted by orchestrator, this is something orchestrator doesn't do) must run a recovery process.

Alibaba's PolarDB and Amazon Aurora use this and improve on this design by having all standby servers actually running and continuously running recovery, disabling any writes to the file system.

shlomi-noach commented 6 years ago

A. ...

Is the solution to have orchestrator keep on trying to set super_read_only=1 as soon as access has returned? Does it give up after a single try?

Good question. Retries pose a risk and will complicate the algorithm. I guess we will kick start this without retries and then possibly iterate.

B.

Another item I was thinking about: It was mentioned it would kick off in a couple of seconds. Maybe a couple of seconds will create too many false positives when used over a less-reliable WAN. (of course it depends on the environment) and the leadership timeout in raft setting should be set to maybe even 1-2 minutes. This will of course slow down the failover operation and possibly increase the amount of writes to the isolated master.

We can introduce a configuration variable to control the raft timeout.

C. ... The network comes back.

   What happens to the nodes in Datacenter A? Does replication get configured back automatically?

There are many scenarios to this since it depends who was replicating from whom. Say DC1 was network isolated and the master was in DC1. It is likely that all servers in DC1 were only replicating from other servers in DC1 (there is justification to going remote and back, but I never saw anyone doing this). In which case they all replicate from the master in DC1-only path. The master itself was not a replica, hence when network comes back it does not replicate. All the servers in DC1 are stale and none receive new data. With Pseudo-GTID it is possible for orchestrator to also determine whether they are corrupted or not (ie have more data than the now-master) and it will efuse putting them back into replication if requested to. There will not be an attempt to do so aumatically since these servers were only replicating locally in the first place, so the replicas in DC1 were never "broken" in the eyes of MySQL, and no recovery is made.

Say DC1 was network isolated, master was in DC1, a replica in DC1 was replicating DC1->DC2->DC1. The replica is still valid when DC1 comes back, because DC2's replica is (likely) not to be configured to replicate from DC1 after the failover. I say "likely" because there might be multiple replication paths between the DCs and due to anti-flapping one will be fixed automatically while others will wait for a human to acknowledge, or from block-time to pass. Anyway, the replica in DC1 is good to reconnect to DC2 once network comes back.

Say DC1 was network isolated bu the master was in DC2. In this case servers in DC1 are known to replicate from server in DC2 or DC3, in which case DC1's network isolation caused DC1 replicas to break. When DC1 comes back MySQL will possibly (depends on connect_retries and connect_timeout and the time DC1 was down) just re-connect them automatically and all should work.

summary: this should be OK.

       Flapping becomes a problem and requires manual interaction every time this happens (might be acceptable)

explain?

       (and likely this is the gist of this point C) how do we prevent the operator/user from not making data inconsistencies as described above? with GTID, some checks can be done and warnings/errors can be provided (if not already).

Pseudo-GTID will refuse connecting two replicas if they have inconsistent entries.

D. ... a non-multi datacenter setup?

orchestrator also has the notion of "physical location", which is like a smaller granularity than DC. Promotion rules prefer promoting a server that is in both same DC and physical location as crashed master. This is already well supported by existing failover algorithms. Likewise, we can expand the network isolation conditions suggested in this issue, to incorporate physical locations.

igroene commented 5 years ago

Hi @shlomi-noach it's been a while since turn isolated master to read_only possibility was last discussed. Wondering if you reached any conclusions or have plans to implement this? I have a customer that wants to run Orchestrator nodes on the same hosts as MySQL (small 3 node cluster), to handle the decision-making. e.g. when master is partitioned, then the "local" Orchestrator loses quorum and makes the partitioned master read-only. Thanks!

shlomi-noach commented 5 years ago

I have no timeline for this.

It's perhaps again noteworthy that even when this is implemented, split-brain is not prevented -- only mitigated.

Related, we expect to release to the public an "unsplit brain" tool.