Open deepthi opened 1 month ago
I'll leave this open until we resolve https://github.com/vitessio/vitess/pull/16725#issuecomment-2334842525
Assigned myself to look into the ERS logic and see what scenarios are broken.
So the current logic (and in particular before #16725) is flawed, and I believe contrary to the correct logic.
@@gtid_executed
rather than the relay log GTID.One thing to note is whether you necessarily intend to wait for the replica to consume its relay logs (hence, its retrieved_gtid_set) I'll discuss with @GuptaManan100
I'm gonna work on improving the logic and add some more challenging testing scenarios.
In order to remedy the situation of errant GTIDs and their detection, we have started by adding the capability of errant GTID detection in each vttablet so that if it finds that it has an errant GTID, then it doesn't even join the replication stream. These changes can be found in https://github.com/vitessio/vitess/pull/16833.
Once these changes are merged, the situations that we run into in ERS will be greatly reduced. The following situations would be left -
For this discussion we'll make the following assumptions for the topology -
We have 4 tablets, t1
, t2
, t3
and t4
and their UUIDs are u1
, u2
, u3
and u4
respectively. Let's say t1 is the primary and it the server that has just crashed that has initiated the ERS.
Within this case, we have a few subcases to consider -
The UUIDs that we'll see in ERS are as follows -
t2: u1:1-100
t3: u1:1-99
t4: u1:1-100
One of the two replicas could be lagged, but we're guaranteed that atleast one semi-sync replica will have all the writes that the primary accepted.
Correct Expectation: In this case, we should select t2
as the new primary since it is the most advanced replica.
The UUIDs that we'll see in ERS are as follows -
t2: u1:1-100, u2:1-30
t3: u1:1-99, u2:1-30
t4: u1:1-100, u2:1-30
Very similar case to the previous one, slightly more complex for errant GTID detection because we need to check that the GTIDs for all the tablets match on all UUIDs other than u1
.
Like before, one of the two replicas could be lagged.
Correct Expectation: In this case, we should select t2
as the new primary since it is the most advanced replica.
The UUIDs that we'll see in ERS are as follows -
t2: u1:1-100, u2:1-30
t3: u1:1-99, u2:1-30
t4: u1:1-101, u2:1-30
This situation can happen if the primary dies while waiting for a semi-sync ACK for a certain write. That write might be present on the replica.
Correct Expectation: In this case, we can select t4
as the intermediate source. It is also theoretically correct to select t2
as the new primary without setting t4
as an intermediate source. In this case, we'll mark the extra write on t4
as errant and will have to throw the tablet away.
From this point, we go into more complicated cases -
We could be in a situation where only 1 replica tablet is up to date with the primary that just crashed and all the other tablets are severely lagged. It could be because they came up from a backup because of errant GTIDs or just because they were replicating slowly.
The UUIDs that we'll see in ERS are as follows -
t2: u1:1-100, u2:1-30, u3: 1-100
t3: u2:1-30, u3: 1-50
t4: u2:1-30
Correct Expectation: We should be selecting t2
as the new primary.
t1: u1-100, u2:1-30, u3:1-100
)The only way this happens is if the semi-sync tablet that sent the primary ACKs for the writes it took, crashes because of some MySQL issue and has to be brought up from backup. The other two tablets could just be lagging or could be coming up from backup themselves.
The UUIDs that we'll see in ERS are as follows -
t2: u2:1-30, u3: 1-50
t3: u2:1-30, u3: 1-50
t4: u2:1-30
This is a very bad situation for us to be in! There is no tablet that we can promote that won't lose writes! We're in this situation because 2 tablets crashed in a short period of time, a primary tablet, and the replica that sent the semi-sync ACKs.
Correct Expectation: Realize that we are in a situation where we've lost writes and inform the customer. When we run with 1 semi-sync ACK, then we are expecting only 1 tablet to crash at a time. We always run the risk of multiple tablets failing and there isn't anything that can be done to save the lost writes especially if there is no backup that has them.
We maybe want to promote t2
to continue to accept writes. This is debatable. We could also fail and wait for human intervention.
This case is relatively rare, but can still happen
The UUIDs that we'll see in ERS are as follows -
t2: u2:1-31, u3:1-50
t3: u1:1-100, u2:1-30, u3:1-50
t4: u1:1-90, u2:1-30, u3:1-50
Correct Expectation: - Mark t2
as errant, and promote t3
.
This situation can happen if the primary tablet dies while waiting for a semi-sync ACK. This write could be present in the rdonly tablet. If the rdonly tablet is also unavailable during the ERS call, then we would proceed further and promote a different tablet without propagating this write. So the old primary and the rdonly will both have this errant GTID.
For this discussion, we'll assume that t2
was a primary previously that crashed which introduced errant GTIDs in t2
and t4
.
The UUIDs that we'll see in ERS are as follows -
t2: u2:1-20, u3:1-50
t3: u1:1-100, u2:1-30, u3:1-50
t4: u2:1-31, u3:1-50
Correct Expectation: - Mark t4
as errant, and promote t3
.
The UUIDs that we'll see in ERS are as follows -
t2: u2:1-31, u3:1-50
t3: u1:1-100, u2:1-30, u3:1-50
t4: u2:1-31, u3:1-50
Correct Expectation: - Mark t2
, t4
as errant, and promote t3
.
We messed up something in Vitess and heartbeat write or a schema tracker write went through on a replica. Or the user messed up and accidentally wrote something in a replica maybe when they were debugging something.
The UUIDs that we'll see in ERS are as follows -
t2: u1:1-99, u2:1-31, u3:1-50
t3: u1:1-100, u2:1-30, u3:1-50
t4: u1:1-100, u2:1-30, u3:1-50
Correct Expectation: - Mark t2
as errant, and promote t3
.
The UUIDs that we'll see in ERS are as follows -
t2: u1:1-101, u2:1-31, u3:1-50
t3: u1:1-100, u2:1-30, u3:1-50
t4: u1:1-100, u2:1-30, u3:1-50
Correct Expectation: - This situation is really bad! t2
has a write u1:101
that we need to keep for correctness's sake, but it also has an errant GTID that could have been added a long time ago, so we don't want to replicate it to the other replicas out of order. It might also be purged, in which case, the replicas would fail to replicate altogether.
We messed up real bad and now both replicas have errant GTIDs, or the user messed up real bad.
The UUIDs that we'll see in ERS are as follows -
t2: u1:1-100, u2:1-31, u3:1-50
t3: u1:1-100, u2:1-30, u3:1-51
t4: u1:1-90, u2:1-30, u3:1-50
Correct Expectation: - If we had another replica that had all the writes from the current primary then we should be promoting it. Otherwise, we can't promote any of them. This is a tricky situation because we don't want to lose the writes that the rdonly doesn't have but at the same time, we don't want the errant GTIDs.
I have written up isolated cases of things going wrong in the cases above, but we could have multiple of ☝️ happening together... For example -
The UUIDs that we'll see in ERS are as follows -
t2: u2:1-31, u3:1-50
t3: u1:1-100, u2:1-30, u3:1-51
t4: u2:1-31, u3:1-50
Correct Expectation: All the tablets are messed up. I don't know if we can do anything about this.
The current Errant GTID code is not sufficient especially for case 2 where it just marks the most advanced tablet as errant. So, we do need to refactor and fix it. The most correct way that takes care of all the cases would be to store the primary's GTID set when it promotes to the topo-server and use that for errant GTID detection in ERS, but that seems very space intensive and quite frankly an overkill.
As a substitute to it, we propose a different solution that doesn't have any such storage overhead. Currenlyt, in ERS, we use the positions of the tablets and who they're replicating from as the 2 points of information for doing the errant GTID detection. We can augment this by also using the length of the reparent journal. Currently we are using the reparent_journal table as an append only log that can be used for debugging but isn't being used anywhere else.
With the new information, we can use the algorithm as follows -
With this, let's see what our new proposed solution would do in all the above listed cases -
Case 1: All the three subcases would be handled correctly.
Case 2: This would work correctly too, because we'll know the tablets are lagged because of the reparent_journal length.
Case 3: This code wouldn't detect this case. We would just promote tablet t2 without knowing we're losing writes. This is the problem with trying to solve the problem without the full GTID set of the previous primary. There is no way to know that all the tablets that we are seeing, are all lagged.
Case 4: We would do the right thing.
Case 5: We would do the right thing in all cases.
Case 6: In 6a, we would do the right thing, and in 6b we would mark t2
as errant and promote t3
. We will lose the u1:101
write, but I don't see any better alternative.
Case 7: We would flag both t2
and t3
as errant. t4
is non-promotable so we would fail, but like case 6b, we could potentially promote a tablet that is lagging. Again, not sure we can do any better.
Case 8: We would promote t3
despite it having an errant GTID.
Overview of the Issue
When ERS is evaluating candidates for promotion, it checks whether any of the candidates has an errant GTID. The way this computation is done can lead to false positives when there are only two candidates. This can lead to ERS choosing to promote a replica that is not actually the most advanced.
Reproduction Steps
This is not easy to reproduce, but can probably be done as follows
Binary Version
Operating System and Environment details
Log Fragments