taosdata / TDengine

High-performance, scalable time-series database designed for Industrial IoT (IIoT) scenarios
https://tdengine.com
GNU Affero General Public License v3.0
23.24k stars 4.84k forks source link

if tdengine3 leader crash ,could not elect a new leader in K8s #20399

Open edisonX-sudo opened 1 year ago

edisonX-sudo commented 1 year ago

Bug Description 用官方提供k8s脚本部署tdengine3.0.2.5,配置3个mnode一旦leader机子宕机,选举会有问题导致一直无法选出leader,整个集群无法工作

To Reproduce

  1. 用官方k8s脚本(https://docs.taosdata.com/deployment/k8s/) 创建出replica为3的tdengine集群
  2. 进入集群通过 create mnode on dnode ; 的方式将3个replica都任命为mnode
  3. 然后在k8s下通过命令删除leader的replica(pod),后续此replica(pod)会自动重启
  4. 但即便自动重启完成,通过第二台replica(pod)的taos命令运行 show mnodes; 也会发现一直无法选出leader的现象

Expected Behavior 能选举成功

Screenshots image

Environment (please complete the following information):

Additional Context 原先的leader(mnode_id=1)的pod重启后的相关日志

03/10 02:04:12.273101 00000090 SYN vgId:1, begin election, sync:candidate, term:12, commit-index:10, first-ver:0, last-ver:26, min:-1, snap:10, snap-term:1, elect-times:9, as-leader-times:0, cfg-ch-times:0, hb-slow:0, hbr-slow:0, aq-items:-1, snaping:-1, replicas:3, last-cfg:-1, chging:0, restore:0, quorum:2, elect-lc-timer:10, hb:0, buffer:[10 10 26, 27), repl-mgrs:{0:0 [0 0, 0), 1:0 [0 0, 0), 2:0 [0 0, 0)}, members:{num:3, as:0, [tdengine-0.taosd.experimentb.svc.cluster.local:6030, tdengine-1.taosd.experimentb.svc.cluster.local:6030, tdengine-2.taosd.experimentb.svc.cluster.local:6030]}, hb:{0:1678413805684,1:1678413805684,2:1678413805684}, hb-reply:{0:1678413805684,1:1678413805684,2:1678413805684}
03/10 02:04:12.292439 00000090 SYN vgId:1, succeed to write raft store file:/var/lib/taos//mnode/sync/raft_store.json, term:13
03/10 02:04:12.311026 00000090 SYN vgId:1, succeed to write raft store file:/var/lib/taos//mnode/sync/raft_store.json, term:13
03/10 02:04:12.329363 00000090 SYN vgId:1, succeed to write raft store file:/var/lib/taos//mnode/sync/raft_store.json, term:13
03/10 02:04:13.545751 00000098 DND ERROR failed to send status req since Sync not leader, epSet:{tdengine-2.taosd.experimentb.svc.cluster.local:6030, tdengine-0.taosd.experimentb.svc.cluster.local:6030, tdengine-1.taosd.experimentb.svc.cluster.local:6030}, inUse:0
yu285 commented 1 year ago

然后在k8s下通过命令删除leader的replica(pod),后续此replica(pod)会自动重启

这块咱们具体操作命令是?

edisonX-sudo commented 1 year ago

然后在k8s下通过命令删除leader的replica(pod),后续此replica(pod)会自动重启

这块咱们具体操作命令是?

kubectl delete pod 'pod_name' -n 'namespace'