在K8S环境部署下,想实现MNode高可用部署,在初始化好默认配置的集群后,使用create mnode on datanode x 命令创建mnode,重启集群后,集群恢复失败。
Bug Description
1、k8s环境部署tdengine-3.0.3.0集群(3mnode-3dnodes-1replica)
删掉pod,tdengine3-0(mnode)之后,新pod建好之后,使用客户端去查库,发现tdengine3-1和tdengine3-2状态offline
查数据时报错“DB error: Fail to get table info, error: Sync not leader”
04/25 09:42:48.114977 00000101 DND ERROR failed to send status req since Sync not leader, epSet:{tdengine-test-0.taosd-test.iot-middleware.svc.cluster.local:6030, tdengine-test-1.taosd-test.iot-middleware.svc.cluster.local:6030, tdengine-test-2.taosd-test.iot-middleware.svc.cluster.local:6030}, inUse:0
集群恢复失败,三个dnode下线
04/25 09:45:34.278141 00000095 SYN ERROR vgId:1, sync send msg by id error, epset:(nil) dnode:0 addr:0 err:0x800009ff
04/25 09:45:34.459281 00000118 MND dnode:1, in offline state
04/25 09:45:34.459334 00000118 MND dnode:2, in offline state
04/25 09:45:34.459342 00000118 MND dnode:3, in offline state
04/25 09:45:34.866014 00000114 MND dnode:3, from offline to online, memory avail:18916853351 total:21037367296 cores:4.00
04/25 09:45:35.009223 00000114 MND dnode:2, mnode syncState from leader to follower, restoreState from 1 to 1
04/25 09:45:35.009253 00000114 MND dnode:2, from offline to online, memory avail:15051386471 total:16742404096 cores:4.00
04/25 09:45:35.308041 00000095 SYN ERROR vgId:1, sync send msg by id error, epset:(nil) dnode:0 addr:0 err:0x800009ff
04/25 09:45:35.461796 00000114 MND tq timer, rebalance counter old val:0
04/25 09:45:35.461854 00000114 MND mq rebalance finished, no modification
04/25 09:45:35.461862 00000114 MND rebalance trans end, rebalance counter:0
04/25 09:45:36.287979 00000095 SYN ERROR vgId:1, sync send msg by id error, epset:(nil) dnode:0 addr:0 err:0x800009ff
04/25 09:45:36.579493 00000116 SYN vgId:1, succeed to write raft store file:/var/lib/taos//mnode/sync/raft_store.json, term:46
04/25 09:45:36.579559 00000116 MND vgId:1, become follower
04/25 09:45:36.579583 00000116 SYN vgId:1, reset sync log buffer. buffer: [22 278 278, 279)
04/25 09:45:36.587952 00000116 SYN vgId:1, succeed to write raft store file:/var/lib/taos//mnode/sync/raft_store.json, term:46
04/25 09:45:36.595919 00000116 SYN vgId:1, succeed to write raft store file:/var/lib/taos//mnode/sync/raft_store.json, term:46
04/25 09:45:36.595976 00000116 SYN vgId:1, recv sync-request-vote from dnode:3, {term:46, last-index:278, last-term:45}, granted:1, sync:follower, term:46, commit-index:278, first-ver:0, last-ver:278, min:-1, snap:199, snap-term:2, elect-times:9, as-leader-times:8, cfg-ch-times:1, hb-slow:0, hbr-slow:0, aq-items:-1, snaping:-1, replicas:3, last-cfg:-1, chging:0, restore:1, quorum:2, elect-lc-timer:2734, hb:0, buffer:[22 278 278, 279), repl-mgrs:{0:0 [0 0, 0), 1:0 [0 0, 0), 2:0 [0 0, 0)}, members:{num:3, as:1, [tdengine-test-0.taosd-test.iot-middleware.svc.cluster.local:6030, tdengine-test-1.taosd-test.iot-middleware.svc.cluster.local:6030, tdengine-test-2.taosd-test.iot-middleware.svc.cluster.local:6030]}, hb:{0:1682386935555,1:1682386015536,2:1682387128640}, hb-reply:{0:1682386015536,1:1682386015536,2:1682387132270}
04/25 09:45:36.596001 00000116 SYN vgId:1, send sync-request-vote-reply to dnode:3 {term:46, grant:1}, , sync:follower, term:46, commit-index:278, first-ver:0, last-ver:278, min:-1, snap:199, snap-term:2, elect-times:9, as-leader-times:8, cfg-ch-times:1, hb-slow:0, hbr-slow:0, aq-items:-1, snaping:-1, replicas:3, last-cfg:-1, chging:0, restore:1, quorum:2, elect-lc-timer:2734, hb:0, buffer:[22 278 278, 279), repl-mgrs:{0:0 [0 0, 0), 1:0 [0 0, 0), 2:0 [0 0, 0)}, members:{num:3, as:1, [tdengine-test-0.taosd-test.iot-middleware.svc.cluster.local:6030, tdengine-test-1.taosd-test.iot-middleware.svc.cluster.local:6030, tdengine-test-2.taosd-test.iot-middleware.svc.cluster.local:6030]}, hb:{0:1682386935555,1:1682386015536,2:1682387128640}, hb-reply:{0:1682386015536,1:1682386015536,2:1682387132270}
4.重启集群所有节点
tdengine-test-0 启动失败
04/25 09:49:23.291385 00000061 MND sdb table:stream is cleaned up
04/25 09:49:23.291393 00000061 MND sdb table:subscribe is cleaned up
04/25 09:49:23.291401 00000061 MND sdb table:consumer is cleaned up
04/25 09:49:23.291409 00000061 MND sdb table:topic is cleaned up
04/25 09:49:23.291417 00000061 MND sdb table:vgroup is cleaned up
04/25 09:49:23.291425 00000061 MND sdb table:sma is cleaned up
04/25 09:49:23.291433 00000061 MND sdb table:stb is cleaned up
04/25 09:49:23.291441 00000061 MND sdb table:db is cleaned up
04/25 09:49:23.291449 00000061 MND sdb table:func is cleaned up
04/25 09:49:23.291457 00000061 MND sdb table:idx is cleaned up
04/25 09:49:23.291464 00000061 MND sdb is cleaned up
04/25 09:49:23.291470 00000061 MND mnode-wal will cleanup
04/25 09:49:23.302234 00000061 MND ERROR failed to open mnode since Invalid host name
04/25 09:49:23.302270 00000061 MND start to close mnode
04/25 09:49:23.302279 00000061 MND mnode is closed
04/25 09:49:23.302287 00000061 DND ERROR failed to open mnode since Invalid host name
04/25 09:49:23.302296 00000061 DND ERROR node:mnode, failed to open since Invalid host name
04/25 09:49:23.302303 00000061 DND ERROR node:mnode, failed to open since Invalid host name
04/25 09:49:23.302309 00000061 DND ERROR failed to open nodes since Invalid host name
04/25 09:49:23.302317 00000061 DND shutting down the service
04/25 09:49:23.682260 00000061 WAL wal module is cleaned up
04/25 09:49:23.682298 00000061 UDF udfd start to stop, need cleanup:1, spawn err:0
04/25 09:49:23.683144 00000061 UDF udfd is cleaned up
04/25 09:49:23.768299 00000061 DND dnode env is cleaned up
04/25 09:49:57.103861 00000035 TAOS_ADAPTER info "init plugin influxdb/v1" model=plugin
[GIN-debug] POST /influxdb/v1/write --> github.com/taosdata/taosadapter/v3/plugin/influxdb.(*Influxdb).write-fm (7 handlers)
04/25 09:49:57.104093 00000035 TAOS_ADAPTER info "init plugin node_exporter/v1" model=plugin
04/25 09:49:57.106273 00000035 TAOS_ADAPTER info "node_exporter disabled" model=NodeExporter
04/25 09:49:57.106292 00000035 TAOS_ADAPTER info "init plugin opentsdb/v1" model=plugin
[GIN-debug] POST /opentsdb/v1/put/json/:db --> github.com/taosdata/taosadapter/v3/plugin/opentsdb.(*Plugin).insertJson-fm (7 handlers)
[GIN-debug] POST /opentsdb/v1/put/telnet/:db --> github.com/taosdata/taosadapter/v3/plugin/opentsdb.(*Plugin).insertTelnet-fm (7 handlers)
04/25 09:49:57.106734 00000035 TAOS_ADAPTER info "all plugin init finish" model=plugin
04/25 09:49:57.106743 00000035 TAOS_ADAPTER info "all plugin start finish" model=plugin
04/25 09:49:57.106870 00000035 TAOS_ADAPTER info "Running in terminal." model=main
04/25 09:49:57.109020 00000035 TAOS_ADAPTER info "server on : 6041" model=main
2: service ok
execute create dnode
Welcome to the TDengine Command Line Interface, Client Version:3.0.3.0
Copyright (c) 2022 by TDengine, all rights reserved.
failed to connect to server, reason: Sync not leader
Additional Context
在K8S环境部署下,想实现MNode高可用部署,在初始化好默认配置的集群后,使用create mnode on datanode x 命令创建mnode,重启集群后,集群恢复失败。
Bug Description
1、k8s环境部署tdengine-3.0.3.0集群(3mnode-3dnodes-1replica) 删掉pod,tdengine3-0(mnode)之后,新pod建好之后,使用客户端去查库,发现tdengine3-1和tdengine3-2状态offline 查数据时报错“DB error: Fail to get table info, error: Sync not leader”
To Reproduce
1.新建一个集群
2.创建mnode
3.重启mnode的leader节点 tdengine-test-0
集群leader选举失败
集群恢复失败,三个dnode下线
4.重启集群所有节点
tdengine-test-0 启动失败
tdengine-test-1和tdengine-test-2 节点启动失败启动失败
k8s statefulset 配置