Bug Report: vtorc ERS incorrectly flags replica as having errant GTIDs

deepthi commented 1 month ago

Overview of the Issue

When ERS is evaluating candidates for promotion, it checks whether any of the candidates has an errant GTID. The way this computation is done can lead to false positives when there are only two candidates. This can lead to ERS choosing to promote a replica that is not actually the most advanced.

Reproduction Steps

This is not easy to reproduce, but can probably be done as follows

Run a cluster with 3 tablets: 1 primary, 2 replicas
manually delay replication on one of the replicas
Take down the primary MySQL, let ERS promote
You should see that it promotes the lagging replica

Binary Version

v18

Operating System and Environment details

Any

Log Fragments

E0905 23:15:34.245238       1 replication.go:126] skipping zone1-100 because we detected errant GTIDs - 8e166b50-d4e3-11ee-9779-e2b8a56b2179:79-84

deepthi commented 1 month ago

I'll leave this open until we resolve https://github.com/vitessio/vitess/pull/16725#issuecomment-2334842525

shlomi-noach commented 1 month ago

Assigned myself to look into the ERS logic and see what scenarios are broken.

shlomi-noach commented 1 month ago

So the current logic (and in particular before #16725) is flawed, and I believe contrary to the correct logic.

It only looks at the relay log GTID, but that's the least interesting part when investigating errant GTIDs, as those are generally created on the replica itself. Therefore, we must use @@gtid_executed rather than the relay log GTID.
It's OK to then union the relay log GTID as a "total-would-be-GTID"
The current logic (a bit mitigated by #16725) prefers promoting a replica that has less relay log GTID. It should pefer promoting the replica that has the largest GTID set (executed+relay)
Since our topology is always flat (all replicas connect directly to the Primary, never sub-replicating from another replica) it is not so important to do specific UUID analysis as described in https://github.com/vitessio/vitess/pull/6296#pullrequestreview-428655574
We also need to consider https://github.com/vitessio/vitess/pull/6296#issuecomment-642502455:

One thing to note is whether you necessarily intend to wait for the replica to consume its relay logs (hence, its retrieved_gtid_set) I'll discuss with @GuptaManan100

I'm gonna work on improving the logic and add some more challenging testing scenarios.

GuptaManan100 commented 1 week ago

In order to remedy the situation of errant GTIDs and their detection, we have started by adding the capability of errant GTID detection in each vttablet so that if it finds that it has an errant GTID, then it doesn't even join the replication stream. These changes can be found in https://github.com/vitessio/vitess/pull/16833.

Once these changes are merged, the situations that we run into in ERS will be greatly reduced. The following situations would be left -

Situations to Consider for Errant GTID Detection

For this discussion we'll make the following assumptions for the topology - We have 4 tablets, t1, t2, t3 and t4 and their UUIDs are u1, u2, u3 and u4 respectively. Let's say t1 is the primary and it the server that has just crashed that has initiated the ERS.

Case 1: We don't have any errant GTIDs.

Within this case, we have a few subcases to consider -

Case 1a: This is the first reparent. A replica is the most advanced.

The UUIDs that we'll see in ERS are as follows -

t2: u1:1-100
t3: u1:1-99
t4: u1:1-100

One of the two replicas could be lagged, but we're guaranteed that atleast one semi-sync replica will have all the writes that the primary accepted.

Correct Expectation: In this case, we should select t2 as the new primary since it is the most advanced replica.

Case 1b: This is not the first reparent. A replica is the most advanced.

The UUIDs that we'll see in ERS are as follows -

t2: u1:1-100, u2:1-30
t3: u1:1-99, u2:1-30
t4: u1:1-100, u2:1-30

Very similar case to the previous one, slightly more complex for errant GTID detection because we need to check that the GTIDs for all the tablets match on all UUIDs other than u1. Like before, one of the two replicas could be lagged.

Correct Expectation: In this case, we should select t2 as the new primary since it is the most advanced replica.

Case 1c: This is not the first reparent. A rdonly is the most advanced.

The UUIDs that we'll see in ERS are as follows -

t2: u1:1-100, u2:1-30
t3: u1:1-99, u2:1-30
t4: u1:1-101, u2:1-30

This situation can happen if the primary dies while waiting for a semi-sync ACK for a certain write. That write might be present on the replica.

Correct Expectation: In this case, we can select t4 as the intermediate source. It is also theoretically correct to select t2 as the new primary without setting t4 as an intermediate source. In this case, we'll mark the extra write on t4 as errant and will have to throw the tablet away.

From this point, we go into more complicated cases -

Case 2: Only 1 tablet is recent and all others are severely lagged

We could be in a situation where only 1 replica tablet is up to date with the primary that just crashed and all the other tablets are severely lagged. It could be because they came up from a backup because of errant GTIDs or just because they were replicating slowly.

The UUIDs that we'll see in ERS are as follows -

t2: u1:1-100, u2:1-30, u3: 1-100
t3: u2:1-30, u3: 1-50
t4: u2:1-30

Correct Expectation: We should be selecting t2 as the new primary.

Case 3: All replicas severely lagged (Primary tablet dies with `t1: u1-100, u2:1-30, u3:1-100`)

The only way this happens is if the semi-sync tablet that sent the primary ACKs for the writes it took, crashes because of some MySQL issue and has to be brought up from backup. The other two tablets could just be lagging or could be coming up from backup themselves.

The UUIDs that we'll see in ERS are as follows -

t2: u2:1-30, u3: 1-50
t3: u2:1-30, u3: 1-50
t4: u2:1-30

This is a very bad situation for us to be in! There is no tablet that we can promote that won't lose writes! We're in this situation because 2 tablets crashed in a short period of time, a primary tablet, and the replica that sent the semi-sync ACKs.

Correct Expectation: Realize that we are in a situation where we've lost writes and inform the customer. When we run with 1 semi-sync ACK, then we are expecting only 1 tablet to crash at a time. We always run the risk of multiple tablets failing and there isn't anything that can be done to save the lost writes especially if there is no backup that has them. We maybe want to promote t2 to continue to accept writes. This is debatable. We could also fail and wait for human intervention.

Case 4: Primary dies and comes back right at the point when a new ERS has started and has an extra UUID.

This case is relatively rare, but can still happen

The UUIDs that we'll see in ERS are as follows -

t2: u2:1-31, u3:1-50
t3: u1:1-100, u2:1-30, u3:1-50
t4: u1:1-90, u2:1-30, u3:1-50

Correct Expectation: - Mark t2 as errant, and promote t3.

Case 5: Old Primary and a rdonly have errant GTID

This situation can happen if the primary tablet dies while waiting for a semi-sync ACK. This write could be present in the rdonly tablet. If the rdonly tablet is also unavailable during the ERS call, then we would proceed further and promote a different tablet without propagating this write. So the old primary and the rdonly will both have this errant GTID.

For this discussion, we'll assume that t2 was a primary previously that crashed which introduced errant GTIDs in t2 and t4.

Case 5a: Old primary is permanently lost and comes up from backup and ronly comes up during ERS

The UUIDs that we'll see in ERS are as follows -

t2: u2:1-20, u3:1-50
t3: u1:1-100, u2:1-30, u3:1-50
t4: u2:1-31, u3:1-50

Correct Expectation: - Mark t4 as errant, and promote t3.

Case 5b: Old primary and ronly come up during ERS

The UUIDs that we'll see in ERS are as follows -

t2: u2:1-31, u3:1-50
t3: u1:1-100, u2:1-30, u3:1-50
t4: u2:1-31, u3:1-50

Correct Expectation: - Mark t2, t4 as errant, and promote t3.

Case 6: Errant GTID introduced on a replica server by a write that shouldn't happen

We messed up something in Vitess and heartbeat write or a schema tracker write went through on a replica. Or the user messed up and accidentally wrote something in a replica maybe when they were debugging something.

Case 6a: The replica with errant GTID is not the most advanced.

The UUIDs that we'll see in ERS are as follows -

t2: u1:1-99, u2:1-31, u3:1-50
t3: u1:1-100, u2:1-30, u3:1-50
t4: u1:1-100, u2:1-30, u3:1-50

Correct Expectation: - Mark t2 as errant, and promote t3.

Case 6b: The replica with errant GTID is the most advanced.

The UUIDs that we'll see in ERS are as follows -

t2: u1:1-101, u2:1-31, u3:1-50
t3: u1:1-100, u2:1-30, u3:1-50
t4: u1:1-100, u2:1-30, u3:1-50

Correct Expectation: - This situation is really bad! t2 has a write u1:101 that we need to keep for correctness's sake, but it also has an errant GTID that could have been added a long time ago, so we don't want to replicate it to the other replicas out of order. It might also be purged, in which case, the replicas would fail to replicate altogether.

Case 7: Both replicas with errant GTIDs

We messed up real bad and now both replicas have errant GTIDs, or the user messed up real bad.

The UUIDs that we'll see in ERS are as follows -

t2: u1:1-100, u2:1-31, u3:1-50
t3: u1:1-100, u2:1-30, u3:1-51
t4: u1:1-90, u2:1-30, u3:1-50

Correct Expectation: - If we had another replica that had all the writes from the current primary then we should be promoting it. Otherwise, we can't promote any of them. This is a tricky situation because we don't want to lose the writes that the rdonly doesn't have but at the same time, we don't want the errant GTIDs.

Case 8: Some combination of the above scenarios.

I have written up isolated cases of things going wrong in the cases above, but we could have multiple of ☝️ happening together... For example -

Case 8a: Old primary and rdonly have errant GTID and come up during ERS and replica has an errant GTID introduced by the user

The UUIDs that we'll see in ERS are as follows -

t2: u2:1-31, u3:1-50
t3: u1:1-100, u2:1-30, u3:1-51
t4: u2:1-31, u3:1-50

Correct Expectation: All the tablets are messed up. I don't know if we can do anything about this.

GuptaManan100 commented 1 week ago

The current Errant GTID code is not sufficient especially for case 2 where it just marks the most advanced tablet as errant. So, we do need to refactor and fix it. The most correct way that takes care of all the cases would be to store the primary's GTID set when it promotes to the topo-server and use that for errant GTID detection in ERS, but that seems very space intensive and quite frankly an overkill.

As a substitute to it, we propose a different solution that doesn't have any such storage overhead. Currenlyt, in ERS, we use the positions of the tablets and who they're replicating from as the 2 points of information for doing the errant GTID detection. We can augment this by also using the length of the reparent journal. Currently we are using the reparent_journal table as an append only log that can be used for debugging but isn't being used anywhere else.

With the new information, we can use the algorithm as follows -

After we stop replication on all the tablets, we first wait for all the relay logs to apply.
Check the reparent_journal length of all the tablets find the maximum length.
Any tablet that doesn't have this maximum length of reparent_journal, we know for sure that it is lagging behind and cannot be trusted.
From the ones that are left, we run the same algorithm that we currently use which is to say that any GTID present in 2 tablets is considered to be non-errant.

With this, let's see what our new proposed solution would do in all the above listed cases - Case 1: All the three subcases would be handled correctly. Case 2: This would work correctly too, because we'll know the tablets are lagged because of the reparent_journal length. Case 3: This code wouldn't detect this case. We would just promote tablet t2 without knowing we're losing writes. This is the problem with trying to solve the problem without the full GTID set of the previous primary. There is no way to know that all the tablets that we are seeing, are all lagged. Case 4: We would do the right thing. Case 5: We would do the right thing in all cases. Case 6: In 6a, we would do the right thing, and in 6b we would mark t2 as errant and promote t3. We will lose the u1:101 write, but I don't see any better alternative. Case 7: We would flag both t2 and t3 as errant. t4 is non-promotable so we would fail, but like case 6b, we could potentially promote a tablet that is lagging. Again, not sure we can do any better. Case 8: We would promote t3 despite it having an errant GTID.

vitessio / vitess