taosdata / TDengine

High-performance, scalable time-series database designed for Industrial IoT (IIoT) scenarios
https://tdengine.com
GNU Affero General Public License v3.0
23.29k stars 4.85k forks source link

TDengine K8S 环境,创建多个mnode,重启节点,集群恢复失败 #21063

Closed manmao closed 1 year ago

manmao commented 1 year ago

Additional Context

在K8S环境部署下,想实现MNode高可用部署,在初始化好默认配置的集群后,使用create mnode on datanode x 命令创建mnode,重启集群后,集群恢复失败。

Bug Description

1、k8s环境部署tdengine-3.0.3.0集群(3mnode-3dnodes-1replica) 删掉pod,tdengine3-0(mnode)之后,新pod建好之后,使用客户端去查库,发现tdengine3-1和tdengine3-2状态offline 查数据时报错“DB error: Fail to get table info, error: Sync not leader”

To Reproduce

1.新建一个集群

image

taos> show dnodes;
     id      |            endpoint            | vnodes | support_vnodes |   status   |       create_time       |       reboot_time       |              note    |
===========================================================================================================================================================================
           1 | tdengine-test-0.taosd-test.... |      0 |              8 | ready      | 2023-04-24 23:51:14.869 | 2023-04-25 09:19:25.071 |    |
           2 | tdengine-test-1.taosd-test.... |      0 |              8 | ready      | 2023-04-24 23:51:20.326 | 2023-04-25 09:19:30.672 |    |
           3 | tdengine-test-2.taosd-test.... |      0 |              8 | ready      | 2023-04-24 23:51:26.918 | 2023-04-25 09:19:37.003 |    |
Query OK, 3 row(s) in set (0.016799s)

2.创建mnode

taos> create mnode on dnode 2;
Create OK, 0 row(s) affected (5.143560s)
taos> create mnode on dnode 3;
Create OK, 0 row(s) affected (1.634405s)
taos> show mnodes;
     id      |            endpoint            |     role     |  status   |       create_time       |       reboot_time       |
==============================================================================================================================
           1 | tdengine-test-0.taosd-test.... | leader       | ready     | 2023-04-24 23:51:14.880 | 2023-04-25 09:26:52.700 |
           2 | tdengine-test-1.taosd-test.... | follower     | ready     | 2023-04-25 09:26:47.138 | 2023-04-25 09:26:52.578 |
           3 | tdengine-test-2.taosd-test.... | follower     | ready     | 2023-04-25 09:26:55.334 | 2023-04-25 09:26:57.915 |
Query OK, 3 row(s) in set (0.005573s)

3.重启mnode的leader节点 tdengine-test-0

集群leader选举失败
 04/25 09:42:48.114977 00000101 DND ERROR failed to send status req since Sync not leader, epSet:{tdengine-test-0.taosd-test.iot-middleware.svc.cluster.local:6030, tdengine-test-1.taosd-test.iot-middleware.svc.cluster.local:6030, tdengine-test-2.taosd-test.iot-middleware.svc.cluster.local:6030}, inUse:0
集群恢复失败,三个dnode下线
 04/25 09:45:34.278141 00000095 SYN ERROR vgId:1, sync send msg by id error, epset:(nil) dnode:0 addr:0 err:0x800009ff
 04/25 09:45:34.459281 00000118 MND dnode:1, in offline state
 04/25 09:45:34.459334 00000118 MND dnode:2, in offline state
 04/25 09:45:34.459342 00000118 MND dnode:3, in offline state
 04/25 09:45:34.866014 00000114 MND dnode:3, from offline to online, memory avail:18916853351 total:21037367296 cores:4.00
 04/25 09:45:35.009223 00000114 MND dnode:2, mnode syncState from leader to follower, restoreState from 1 to 1
 04/25 09:45:35.009253 00000114 MND dnode:2, from offline to online, memory avail:15051386471 total:16742404096 cores:4.00
 04/25 09:45:35.308041 00000095 SYN ERROR vgId:1, sync send msg by id error, epset:(nil) dnode:0 addr:0 err:0x800009ff
 04/25 09:45:35.461796 00000114 MND tq timer, rebalance counter old val:0
 04/25 09:45:35.461854 00000114 MND mq rebalance finished, no modification
 04/25 09:45:35.461862 00000114 MND rebalance trans end, rebalance counter:0
 04/25 09:45:36.287979 00000095 SYN ERROR vgId:1, sync send msg by id error, epset:(nil) dnode:0 addr:0 err:0x800009ff
 04/25 09:45:36.579493 00000116 SYN vgId:1, succeed to write raft store file:/var/lib/taos//mnode/sync/raft_store.json, term:46
 04/25 09:45:36.579559 00000116 MND vgId:1, become follower
 04/25 09:45:36.579583 00000116 SYN vgId:1, reset sync log buffer. buffer: [22 278 278, 279)
 04/25 09:45:36.587952 00000116 SYN vgId:1, succeed to write raft store file:/var/lib/taos//mnode/sync/raft_store.json, term:46
 04/25 09:45:36.595919 00000116 SYN vgId:1, succeed to write raft store file:/var/lib/taos//mnode/sync/raft_store.json, term:46
 04/25 09:45:36.595976 00000116 SYN vgId:1, recv sync-request-vote from dnode:3, {term:46, last-index:278, last-term:45}, granted:1, sync:follower, term:46, commit-index:278, first-ver:0, last-ver:278, min:-1, snap:199, snap-term:2, elect-times:9, as-leader-times:8, cfg-ch-times:1, hb-slow:0, hbr-slow:0, aq-items:-1, snaping:-1, replicas:3, last-cfg:-1, chging:0, restore:1, quorum:2, elect-lc-timer:2734, hb:0, buffer:[22 278 278, 279), repl-mgrs:{0:0 [0 0, 0), 1:0 [0 0, 0), 2:0 [0 0, 0)}, members:{num:3, as:1, [tdengine-test-0.taosd-test.iot-middleware.svc.cluster.local:6030, tdengine-test-1.taosd-test.iot-middleware.svc.cluster.local:6030, tdengine-test-2.taosd-test.iot-middleware.svc.cluster.local:6030]}, hb:{0:1682386935555,1:1682386015536,2:1682387128640}, hb-reply:{0:1682386015536,1:1682386015536,2:1682387132270}

 04/25 09:45:36.596001 00000116 SYN vgId:1, send sync-request-vote-reply to dnode:3 {term:46, grant:1}, , sync:follower, term:46, commit-index:278, first-ver:0, last-ver:278, min:-1, snap:199, snap-term:2, elect-times:9, as-leader-times:8, cfg-ch-times:1, hb-slow:0, hbr-slow:0, aq-items:-1, snaping:-1, replicas:3, last-cfg:-1, chging:0, restore:1, quorum:2, elect-lc-timer:2734, hb:0, buffer:[22 278 278, 279), repl-mgrs:{0:0 [0 0, 0), 1:0 [0 0, 0), 2:0 [0 0, 0)}, members:{num:3, as:1, [tdengine-test-0.taosd-test.iot-middleware.svc.cluster.local:6030, tdengine-test-1.taosd-test.iot-middleware.svc.cluster.local:6030, tdengine-test-2.taosd-test.iot-middleware.svc.cluster.local:6030]}, hb:{0:1682386935555,1:1682386015536,2:1682387128640}, hb-reply:{0:1682386015536,1:1682386015536,2:1682387132270}

4.重启集群所有节点

tdengine-test-0 启动失败
 04/25 09:49:23.291385 00000061 MND sdb table:stream is cleaned up

 04/25 09:49:23.291393 00000061 MND sdb table:subscribe is cleaned up

 04/25 09:49:23.291401 00000061 MND sdb table:consumer is cleaned up

 04/25 09:49:23.291409 00000061 MND sdb table:topic is cleaned up

 04/25 09:49:23.291417 00000061 MND sdb table:vgroup is cleaned up

 04/25 09:49:23.291425 00000061 MND sdb table:sma is cleaned up

 04/25 09:49:23.291433 00000061 MND sdb table:stb is cleaned up

 04/25 09:49:23.291441 00000061 MND sdb table:db is cleaned up

 04/25 09:49:23.291449 00000061 MND sdb table:func is cleaned up

 04/25 09:49:23.291457 00000061 MND sdb table:idx is cleaned up

 04/25 09:49:23.291464 00000061 MND sdb is cleaned up

 04/25 09:49:23.291470 00000061 MND mnode-wal will cleanup

 04/25 09:49:23.302234 00000061 MND ERROR failed to open mnode since Invalid host name

 04/25 09:49:23.302270 00000061 MND start to close mnode

 04/25 09:49:23.302279 00000061 MND mnode is closed

 04/25 09:49:23.302287 00000061 DND ERROR failed to open mnode since Invalid host name

 04/25 09:49:23.302296 00000061 DND ERROR node:mnode, failed to open since Invalid host name

 04/25 09:49:23.302303 00000061 DND ERROR node:mnode, failed to open since Invalid host name

 04/25 09:49:23.302309 00000061 DND ERROR failed to open nodes since Invalid host name

 04/25 09:49:23.302317 00000061 DND shutting down the service

 04/25 09:49:23.682260 00000061 WAL wal module is cleaned up

 04/25 09:49:23.682298 00000061 UDF udfd start to stop, need cleanup:1, spawn err:0

 04/25 09:49:23.683144 00000061 UDF udfd is cleaned up

 04/25 09:49:23.768299 00000061 DND dnode env is cleaned up
 04/25 09:51:27.870581 00000090 DND ERROR failed to send status req since Sync not leader, epSet:{tdengine-test-0.taosd-test.iot-middleware.svc.cluster.local:6030, tdengine-test-1.taosd-test.iot-middleware.svc.cluster.local:6030, tdengine-test-2.taosd-test.iot-middleware.svc.cluster.local:6030}, inUse:0

 04/25 09:51:29.361793 00000084 SYN vgId:1, begin election, sync:candidate, term:103, commit-index:210, first-ver:0, last-ver:258, min:-1, snap:210, snap-term:2, elect-times:22, as-leader-times:0, cfg-ch-times:0, hb-slow:0, hbr-slow:0, aq-items:-1, snaping:-1, replicas:3, last-cfg:-1, chging:0, restore:0, quorum:2, elect-lc-timer:23, hb:0, buffer:[210 210 258, 259), repl-mgrs:{0:0 [0 0, 0), 1:0 [0 0, 0), 2:0 [0 0, 0)}, members:{num:3, as:0, [tdengine-test-0.taosd-test.iot-middleware.svc.cluster.local:6030, tdengine-test-1.taosd-test.iot-middleware.svc.cluster.local:6030, tdengine-test-2.taosd-test.iot-middleware.svc.cluster.local:6030]}, hb:{0:1682387382122,1:1682387382122,2:1682387382122}, hb-reply:{0:1682387382122,1:1682387382122,2:1682387382122}
tdengine-test-1和tdengine-test-2 节点启动失败启动失败
 04/25 09:49:57.103861 00000035 TAOS_ADAPTER info "init plugin influxdb/v1" model=plugin

 [GIN-debug] POST   /influxdb/v1/write        --> github.com/taosdata/taosadapter/v3/plugin/influxdb.(*Influxdb).write-fm (7 handlers)

 04/25 09:49:57.104093 00000035 TAOS_ADAPTER info "init plugin node_exporter/v1" model=plugin

 04/25 09:49:57.106273 00000035 TAOS_ADAPTER info "node_exporter disabled" model=NodeExporter

 04/25 09:49:57.106292 00000035 TAOS_ADAPTER info "init plugin opentsdb/v1" model=plugin

 [GIN-debug] POST   /opentsdb/v1/put/json/:db --> github.com/taosdata/taosadapter/v3/plugin/opentsdb.(*Plugin).insertJson-fm (7 handlers)

 [GIN-debug] POST   /opentsdb/v1/put/telnet/:db --> github.com/taosdata/taosadapter/v3/plugin/opentsdb.(*Plugin).insertTelnet-fm (7 handlers)

 04/25 09:49:57.106734 00000035 TAOS_ADAPTER info "all plugin init finish" model=plugin

 04/25 09:49:57.106743 00000035 TAOS_ADAPTER info "all plugin start finish" model=plugin

 04/25 09:49:57.106870 00000035 TAOS_ADAPTER info "Running in terminal." model=main

 04/25 09:49:57.109020 00000035 TAOS_ADAPTER info "server on : 6041" model=main

 2: service ok

 execute create dnode
 Welcome to the TDengine Command Line Interface, Client Version:3.0.3.0
 Copyright (c) 2022 by TDengine, all rights reserved.
 failed to connect to server, reason: Sync not leader

k8s statefulset 配置

kind: StatefulSet
apiVersion: apps/v1
metadata:
  name: tdengine-test
  namespace: middleware
  labels:
    app: tdengine-test
  annotations:
    kubesphere.io/creator: admin
spec:
  replicas: 3
  selector:
    matchLabels:
      app: tdengine-test
  template:
    metadata:
      name: tdengine-test
      creationTimestamp: null
      labels:
        app: tdengine-test
      annotations:
        kubesphere.io/creator: admin
        kubesphere.io/restartedAt: '2023-04-24T15:49:54.742Z'
    spec:
      containers:
        - name: tdengine-test
          image: 'tdengine/tdengine:3.0.3.0'
          ports:
            - name: tcp6030
              containerPort: 6030
              protocol: TCP
            - name: tcp6041
              containerPort: 6041
              protocol: TCP
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: SERVICE_NAME
              value: taosd-test
            - name: STS_NAME
              value: tdengine-test
            - name: STS_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            - name: TZ
              value: Asia/Shanghai
            - name: TAOS_SERVER_PORT
              value: '6030'
            - name: TAOS_FIRST_EP
              value: >-
                $(STS_NAME)-0.$(SERVICE_NAME).$(STS_NAMESPACE).svc.cluster.local:$(TAOS_SERVER_PORT)
            - name: TAOS_SECOND_EP
              value: >-
                $(STS_NAME)-1.$(SERVICE_NAME).$(STS_NAMESPACE).svc.cluster.local:$(TAOS_SERVER_PORT)
            - name: TAOS_FQDN
              value: $(POD_NAME).$(SERVICE_NAME).$(STS_NAMESPACE).svc.cluster.local
          resources:
            limits:
              cpu: 200m
              memory: 1000Mi
            requests:
              cpu: 200m
              memory: 512Mi
          volumeMounts:
            - name: taosdata-test
              mountPath: /var/lib/taos
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      securityContext: {}
      schedulerName: default-scheduler
  volumeClaimTemplates:
    - kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: taosdata-test
        creationTimestamp: null
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi
        storageClassName: nfs-sc
        volumeMode: Filesystem
      status:
        phase: Pending
  serviceName: taosd-test
  podManagementPolicy: OrderedReady
  updateStrategy:
    type: RollingUpdate
  revisionHistoryLimit: 10
manmao commented 1 year ago

目前K8S部署不稳定,存在问题,建议大家使用虚拟机部署,不存在问题

doramingo commented 11 months ago

同样的问题

ohhhyuan commented 7 months ago

我想问一下,你用K8S部署的TD集群怎么进行数据迁移的

ghost8publisment commented 6 months ago

解决了吗