We recently noticed the following order of steps happening -
A primary tablet, lets say A is running.
Something goes wrong, and an ERS is triggered (Could be a network partition or whatever)
Tablet A ends up with an extra errant GTID that the new primary doesn't have.
VTOrc detects that A is not connected to any primary, (it doesn't do an errant GTID detection because there is no primary) and tries to call SetReplicationSource.
This call to set replication source fails, because the vttablet sees it has an errant GTID.
The problem is that VTOrc only knows about errant gtids after a tablet is replicating from another tablet, but vttablets run the detection when they are starting replication. This means that for a tablet with errant gtid not replicating from any tablet, VTOrc is unable to run the errant gtid detected recovery.
Overview of the Issue
We recently noticed the following order of steps happening -
A
is running.A
ends up with an extra errant GTID that the new primary doesn't have.A
is not connected to any primary, (it doesn't do an errant GTID detection because there is no primary) and tries to callSetReplicationSource
.The problem is that VTOrc only knows about errant gtids after a tablet is replicating from another tablet, but vttablets run the detection when they are starting replication. This means that for a tablet with errant gtid not replicating from any tablet, VTOrc is unable to run the errant gtid detected recovery.
Reproduction Steps
Described above
Binary Version
Operating System and Environment details
Log Fragments
No response