radondb / xenon

The MySQL Cluster Autopilot Management with GTID and Raft
GNU General Public License v3.0
693 stars 162 forks source link

低数据量的Follower被选举成Leader问题 #173

Closed linjianxing closed 9 months ago

linjianxing commented 1 year ago

过程简述:

问题

  1. mysql-1宕机1小时存在数据同步延迟,在向mysql-0发起投票时,不该获取投票。
  2. mysql-1 1小时内都未被成功设置成slave。

GTID对比逻辑

源码

func (m *Mysql) GTIDGreaterThan(gtid *model.GTID) (bool, model.GTID, error) {
    log := m.log
    this, err := m.GetGTID()
    if err != nil {
        return false, this, err
    }

    a := strings.ToUpper(fmt.Sprintf("%s:%016d", this.Master_Log_File, this.Read_Master_Log_Pos))
    b := strings.ToUpper(fmt.Sprintf("%s:%016d", gtid.Master_Log_File, gtid.Read_Master_Log_Pos))
    log.Warning("mysql.gtid.compare.this[%v].from[%v]", this, gtid)
    cmp := strings.Compare(a, b)
    // compare seconds behind master
    if cmp == 0 {
        thislag, err1 := strconv.Atoi(this.Seconds_Behind_Master)
        gtidlag, err2 := strconv.Atoi(gtid.Seconds_Behind_Master)
        if err1 == nil && err2 == nil {
            return (thislag < gtidlag), this, nil
        }
    }
    return cmp > 0, this, nil
}

问题

mysql-1 binlog名称 mysql-bin.000184 远远大于mysql-0的mysql-bin.000055,在这种情况下,上面的比对会出bug。

详细日志信息

2023/05/16 08:06:12.412500 trace.go:37: [ERROR] FOLLOWER[ID:mysql-1.mysql.mysqlcluster:8801, V:226, E:0].mysql.DisableSemiSyncMaster.error[invalid connection]

2023/05/16 08:06:12.412790 trace.go:37: [ERROR] FOLLOWER[ID:mysql-1.mysql.mysqlcluster:8801, V:226, E:0].mysql.StartSlave.error[dial tcp 127.0.0.1:3306: connect: connection refused]

2023/05/16 08:07:54.449018 trace.go:32: [WARNING] FOLLOWER[ID:mysql-0.mysql.mysqlcluster:8801, V:226, E:0].get.heartbeat.from[N:mysql-2.mysql.mysqlcluster:8801, V:228, E:0].update.view

2023/05/16 08:07:56.762338 trace.go:37: [ERROR] FOLLOWER[ID:mysql-1.mysql.mysqlcluster:8801, V:226, E:0].mysql.StartSlave.error[Error 1200: The server is not configured as slave; fix in config file or with CHANGE MASTER TO]

2023/05/16 09:25:49.978911 trace.go:37: [ERROR] FOLLOWER[ID:mysql-1.mysql.mysqlcluster:8801, V:228, E:0].mysql.StartSlave.error[Error 1200: The server is not configured as slave; fix in config file or with CHANGE MASTER TO]

2023/05/16 09:25:51.866905 mysql.go:95: [ERROR] mysql[127.0.0.1:3306].ping.error[invalid connection].downs:0,downslimits:3

2023/05/16 09:25:53.055465 trace.go:37: [ERROR] LEADER[ID:mysql-2.mysql.mysqlcluster:8801, V:228, E:0].mysql.enable.semi-sync.error[dial tcp 127.0.0.1:3306: connect: connection refused]

2023/05/16 09:25:59.982882 trace.go:32: [WARNING] LEADER[ID:mysql-2.mysql.mysqlcluster:8801, V:228, E:0].feel.mysql.down.degrade.to.follower

2023/05/16 09:25:59.982909 trace.go:32: [WARNING] LEADER[ID:mysql-2.mysql.mysqlcluster:8801, V:228, E:0].degrade.to.follower.stop.the.vip...

 2023/05/16 09:25:59.994785 trace.go:37: [ERROR] LEADER[ID:mysql-2.mysql.mysqlcluster:8801, V:228, E:0].leaderStopShellCommand[[-c /xenonchecker leaderStop]].out[bash -c /xenonchecker leaderStoptime="2023-05-16T09:25:59Z" level=info msg="debug flag set to false"

time="2023-05-16T09:25:59Z" level=info msg="leader stop started"

2023/05/16 09:25:59.994835 trace.go:32: [WARNING] FOLLOWER[ID:mysql-2.mysql.mysqlcluster:8801, V:228, E:0].leader.state.machine.exit.done

 2023/05/16 09:25:59.994839 trace.go:27: [INFO] FOLLOWER[ID:mysql-2.mysql.mysqlcluster:8801, V:228, E:0].start.CheckBrainSplit

2023/05/16 09:26:07.984764 trace.go:32: [WARNING] FOLLOWER[ID:mysql-1.mysql.mysqlcluster:8801, V:228, E:0].timeout.to.do.new.election

2023/05/16 09:26:07.987555 trace.go:32: [WARNING] FOLLOWER[ID:mysql-1.mysql.mysqlcluster:8801, V:228, E:0].timeout.and.ping.almost.node.successed.promote.to.candidate

 2023/05/16 09:26:07.987748 trace.go:32: [WARNING] CANDIDATE[ID:mysql-1.mysql.mysqlcluster:8801, V:229, E:0].prepare.send.requestvote.to[mysql-0.mysql.mysqlcluster:8801]

2023/05/16 09:26:07.987749 trace.go:32: [WARNING] CANDIDATE[ID:mysql-1.mysql.mysqlcluster:8801, V:229, E:0].prepare.send.requestvote.to[mysql-2.mysql.mysqlcluster:8801]

 2023/05/16 09:26:07.990584 trace.go:32: [WARNING] FOLLOWER[ID:mysql-0.mysql.mysqlcluster:8801, V:228, E:0].timeout.to.do.new.election

2023/05/16 09:26:07.990644 trace.go:32: [WARNING] CANDIDATE[ID:mysql-1.mysql.mysqlcluster:8801, V:229, E:0].send.requestvote.to.peer[mysql-2.mysql.mysqlcluster:8801].request.gtid[{mysql-bin.000184 316 true true 11feedd0-ee20-11ed-a20f-cafa3766553c:1-12278,
4f732fa9-c310-11ed-b388-42d2b93937c0:1-94,
695d8469-c310-11ed-a761-f6d1a093564e:1-25346104,
8d76a71c-c310-11ed-9fae-be7250e07a8c:1-34854363 0 }]

2023/05/16 09:26:07.991118 trace.go:32: [WARNING] CANDIDATE[ID:mysql-1.mysql.mysqlcluster:8801, V:229, E:0].send.requestvote.to.peer[mysql-0.mysql.mysqlcluster:8801].request.gtid[{mysql-bin.000184 316 true true 11feedd0-ee20-11ed-a20f-cafa3766553c:1-12278,
4f732fa9-c310-11ed-b388-42d2b93937c0:1-94,
695d8469-c310-11ed-a761-f6d1a093564e:1-25346104,
8d76a71c-c310-11ed-9fae-be7250e07a8c:1-34854363 0 }]

2023/05/16 09:26:07.993189 api.go:285: [INFO] mysql.slave.status:&{mysql-bin.000055 942795682 false Connecting true Yes 11feedd0-ee20-11ed-a20f-cafa3766553c:1-13257 11feedd0-ee20-11ed-a20f-cafa3766553c:1-13257,
4f732fa9-c310-11ed-b388-42d2b93937c0:1-94,
695d8469-c310-11ed-a761-f6d1a093564e:1-25346104,
8d76a71c-c310-11ed-9fae-be7250e07a8c:1-34854363 Slave has read all relay log; waiting for more updates error reconnecting to master 'radondb_repl@mysql-2.mysql.mysqlcluster:3306' - retry-time: 60 retries: 1 message: Lost connection to MySQL server at 'reading initial communication packet', system error: 104 }

2023/05/16 09:26:07.993567 trace.go:32: [WARNING] CANDIDATE[ID:mysql-1.mysql.mysqlcluster:8801, V:229, E:0].send.requestvote.done.to[mysql-2.mysql.mysqlcluster:8801]

 2023/05/16 09:26:07.996706 trace.go:32: [WARNING] FOLLOWER[ID:mysql-2.mysql.mysqlcluster:8801, V:228, E:0].get.voterequest.from[mysql-1.mysql.mysqlcluster:8801].request[{mysql-bin.000184 316 true true 11feedd0-ee20-11ed-a20f-cafa3766553c:1-12278,

2023/05/16 09:26:09.993445 trace.go:32: [WARNING] CANDIDATE[ID:mysql-1.mysql.mysqlcluster:8801, V:229, E:0].get.vote.response.from[N:mysql-2.mysql.mysqlcluster:8801, R:FOLLOWER].rsp.gtid[{ 0 false false }].retcode[ErrorMySQLDown]

2023/05/16 09:26:09.993450 trace.go:32: [WARNING] CANDIDATE[ID:mysql-1.mysql.mysqlcluster:8801, V:229, E:0].get.vote.response.from[N:mysql-2.mysql.mysqlcluster:8801, V:228].error[ErrorMySQLDown].peers.number[3]

 2023/05/16 09:26:09.993456 trace.go:32: [WARNING] CANDIDATE[ID:mysql-1.mysql.mysqlcluster:8801, V:229, E:0].get.vote.response.from[N:, R:].rsp.gtid[{ 0 false false }].retcode[ErrorRpcCall]

 2023/05/16 09:26:09.993463 trace.go:32: [WARNING] CANDIDATE[ID:mysql-1.mysql.mysqlcluster:8801, V:229, E:0].get.vote.response.from[N:, V:0].error[ErrorRpcCall].but.not.downgrade.to.follower

 2023/05/16 09:26:09.993523 trace.go:32: [WARNING] CANDIDATE[ID:mysql-1.mysql.mysqlcluster:8801, V:229, E:0].get.voterequest.from[{Raft:{EpochID:0 ViewID:229 Leader: From:mysql-0.mysql.mysqlcluster:8801 To:mysql-1.mysql.mysqlcluster:8801 State:} Repl:{Master_Host: Master_Port:0 Repl_User: Repl_Password: Repl_GTID_Purged:} GTID:{Master_Log_File:mysql-bin.000055 Read_Master_Log_Pos:942795682 Relay_Master_Log_File: Slave_IO_Running:false Slave_IO_Running_Str:No Slave_SQL_Running:true Slave_SQL_Running_Str:Yes Retrieved_GTID_Set:11feedd0-ee20-11ed-a20f-cafa3766553c:1-13257 Executed_GTID_Set:11feedd0-ee20-11ed-a20f-cafa3766553c:1-13257,
4f732fa9-c310-11ed-b388-42d2b93937c0:1-94,
695d8469-c310-11ed-a761-f6d1a093564e:1-25346104,
8d76a71c-c310-11ed-9fae-be7250e07a8c:1-34854363 Seconds_Behind_Master: Slave_SQL_Running_State:Slave has read all relay log; waiting for more updates Last_Error: Last_IO_Error:error reconnecting to master 'radondb_repl@mysql-2.mysql.mysqlcluster:3306' - retry-time: 60 retries: 1 message: Lost connection to MySQL server at 'reading initial communication packet', system error: 104 Last_SQL_Error:} Peers:[] IdlePeers:[]}]

 2023/05/16 09:26:09.998803 trace.go:37: [ERROR] CANDIDATE[ID:mysql-1.mysql.mysqlcluster:8801, V:229, E:0].send.ping.to.peer[mysql-0.mysql.mysqlcluster:8801].client.call.error[rpc.client.call[RaftRPC.Ping].timeout[2000]]

 2023/05/16 09:26:10.013119 trace.go:32: [WARNING] FOLLOWER[ID:mysql-2.mysql.mysqlcluster:8801, V:228, E:0].timeout.to.do.new.election

 2023/05/16 09:26:18.000292 trace.go:32: [WARNING] CANDIDATE[ID:mysql-1.mysql.mysqlcluster:8801, V:230, E:0].get.vote.response.from[N:mysql-2.mysql.mysqlcluster:8801, R:FOLLOWER].rsp.gtid[{ 0 false false }].retcode[ErrorMySQLDown]
 2023/05/16 09:26:18.000342 trace.go:27: [INFO] CANDIDATE[ID:mysql-1.mysql.mysqlcluster:8801, V:230, E:0].get.vote.response.from[N:mysql-0.mysql.mysqlcluster:8801, V:229].ok.votegranted[2].majoyrity[2]
 2023/05/16 09:26:18.003650 api.go:104:      [WARNING]  mysql.gtid.compare.this[{mysql-bin.000055 942795682  false No true Yes 11feedd0-ee20-11ed-a20f-cafa3766553c:1-13257 11feedd0-ee20-11ed-a20f-cafa3766553c:1-13257,
4f732fa9-c310-11ed-b388-42d2b93937c0:1-94,
695d8469-c310-11ed-a761-f6d1a093564e:1-25346104,
8d76a71c-c310-11ed-9fae-be7250e07a8c:1-34854363  Slave has read all relay log; waiting for more updates  error reconnecting to master 'radondb_repl@mysql-2.mysql.mysqlcluster:3306' - retry-time: 60 retries: 1 message: Lost connection to MySQL server at 'reading initial communication packet', system error: 104 }].from[&{mysql-bin.000184 316  true  true   11feedd0-ee20-11ed-a20f-cafa3766553c:1-12278,
4f732fa9-c310-11ed-b388-42d2b93937c0:1-94,
695d8469-c310-11ed-a761-f6d1a093564e:1-25346104,
8d76a71c-c310-11ed-9fae-be7250e07a8c:1-34854363 0    }]
2023/05/16 09:26:18.003659 trace.go:32:      [WARNING]  FOLLOWER[ID:mysql-0.mysql.mysqlcluster:8801, V:229, E:0].do.updateViewID[FROM:229 TO:230]
2023/05/16 09:26:18.003664 trace.go:32:      [WARNING]  FOLLOWER[ID:mysql-0.mysql.mysqlcluster:8801, V:230, E:0].get.requestvote.from[N:mysql-1.mysql.mysqlcluster:8801, V:230, E:0].vote.for.this.candidate
 2023/05/16 09:26:23.000342 trace.go:32:         [WARNING]  CANDIDATE[ID:mysql-1.mysql.mysqlcluster:8801, V:230, E:0].get.enough.votes[2]/members[3].become.leader
zjj1002 commented 1 year ago

对应MHA的逻辑是这样的 1.Relay log position: MHA checks the relay log position of each slave. The relay log is a temporary storage for replication events received from the master. The slave with the highest relay log position is likely to have the most recent data.

  1. Replication lag: MHA considers the replication lag of each slave. Replication lag is the time delay between the master database making changes and the changes being applied to the slave. MHA prefers a slave with lower replication lag as it indicates more up-to-date data.
  2. Heartbeat status: MHA continuously monitors the heartbeat status of each slave. Heartbeat is a mechanism used by MHA to check the availability and connectivity of the slaves. If a slave's heartbeat is not received within a certain timeout period, it may be considered less reliable and not chosen as the new master.