vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.7k stars 2.1k forks source link

Bug Report: VTOrc in unable to detect errant GTIDs on a recently demoted primary #17254

Open GuptaManan100 opened 2 days ago

GuptaManan100 commented 2 days ago

Overview of the Issue

We recently noticed the following order of steps happening -

  1. A primary tablet, lets say A is running.
  2. Something goes wrong, and an ERS is triggered (Could be a network partition or whatever)
  3. Tablet A ends up with an extra errant GTID that the new primary doesn't have.
  4. VTOrc detects that A is not connected to any primary, (it doesn't do an errant GTID detection because there is no primary) and tries to call SetReplicationSource.
  5. This call to set replication source fails, because the vttablet sees it has an errant GTID.

The problem is that VTOrc only knows about errant gtids after a tablet is replicating from another tablet, but vttablets run the detection when they are starting replication. This means that for a tablet with errant gtid not replicating from any tablet, VTOrc is unable to run the errant gtid detected recovery.

Reproduction Steps

Described above

Binary Version

v21 and main

Operating System and Environment details

-

Log Fragments

No response