pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.22k stars 493 forks source link

Should additional stores be released after their recovery? #364

Closed zyguan closed 5 years ago

zyguan commented 5 years ago

When a tikv store failed after maxStoreDownTime, a new store will be created by operator. However, the failed store might be recovered after failover. In this case, the number of up stores is greater than expected replicas. Here is an example:

2019-04-03_234149

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  creationTimestamp: 2019-04-03T14:37:32Z
  generation: 1
  labels:
    app.kubernetes.io/component: tidb-cluster
    app.kubernetes.io/instance: test
    app.kubernetes.io/managed-by: Tiller
    app.kubernetes.io/name: tidb-cluster
    helm.sh/chart: tidb-cluster-0.1.0
  name: demo
  namespace: tidb
  resourceVersion: "1213017"
  selfLink: /apis/pingcap.com/v1alpha1/namespaces/tidb/tidbclusters/demo
  uid: 03e2e3a5-561e-11e9-ae1d-4cd98f58aac2
spec:
  pd:
    image: pingcap/pd:v2.1.7
    imagePullPolicy: IfNotPresent
    limits: {}
    nodeSelectorRequired: true
    replicas: 3
    requests:
      storage: 1Gi
    storageClassName: local-storage
  pvReclaimPolicy: Retain
  schedulerName: tidb-scheduler
  services:
  - name: pd
    type: ClusterIP
  tidb:
    image: pingcap/tidb:v2.1.7
    imagePullPolicy: IfNotPresent
    limits: {}
    maxFailoverCount: 3
    nodeSelector:
      kubernetes.io/hostname: 172.16.4.61,172.16.4.62,172.16.4.63
    nodeSelectorRequired: true
    replicas: 3
    requests: {}
    slowLogTailer:
      image: busybox:1.26.2
      imagePullPolicy: IfNotPresent
      limits:
        cpu: 100m
        memory: 50Mi
      requests:
        cpu: 20m
        memory: 5Mi
  tikv:
    image: pingcap/tikv:v2.1.7
    imagePullPolicy: IfNotPresent
    limits: {}
    nodeSelectorRequired: true
    replicas: 3
    requests:
      storage: 10Gi
    storageClassName: local-storage
  tikvPromGateway:
    image: prom/pushgateway:v0.3.1
    imagePullPolicy: IfNotPresent
    limits: {}
    requests: {}
  timezone: UTC
status:
  clusterID: "6675677409353421762"
  pd:
    leader:
      clientURL: http://demo-pd-0.demo-pd-peer.tidb.svc:2379
      health: true
      id: "13229517333287924650"
      lastTransitionTime: 2019-04-03T14:38:08Z
      name: demo-pd-0
    members:
      demo-pd-0:
        clientURL: http://demo-pd-0.demo-pd-peer.tidb.svc:2379
        health: true
        id: "13229517333287924650"
        lastTransitionTime: 2019-04-03T14:38:08Z
        name: demo-pd-0
      demo-pd-1:
        clientURL: http://demo-pd-1.demo-pd-peer.tidb.svc:2379
        health: true
        id: "12896584759218299636"
        lastTransitionTime: 2019-04-03T14:38:08Z
        name: demo-pd-1
      demo-pd-2:
        clientURL: http://demo-pd-2.demo-pd-peer.tidb.svc:2379
        health: true
        id: "1946055682103402276"
        lastTransitionTime: 2019-04-03T14:38:08Z
        name: demo-pd-2
    phase: Normal
    statefulSet:
      collisionCount: 0
      currentReplicas: 3
      currentRevision: demo-pd-54994fd4b4
      observedGeneration: 1
      readyReplicas: 3
      replicas: 3
      updateRevision: demo-pd-54994fd4b4
      updatedReplicas: 3
    synced: true
  tidb:
    members:
      demo-tidb-0:
        health: true
        lastTransitionTime: 2019-04-03T14:38:56Z
        name: demo-tidb-0
      demo-tidb-1:
        health: true
        lastTransitionTime: 2019-04-03T14:39:03Z
        name: demo-tidb-1
      demo-tidb-2:
        health: true
        lastTransitionTime: 2019-04-03T14:39:03Z
        name: demo-tidb-2
    phase: Normal
    statefulSet:
      collisionCount: 0
      currentReplicas: 3
      currentRevision: demo-tidb-c44f8d566
      observedGeneration: 1
      readyReplicas: 3
      replicas: 3
      updateRevision: demo-tidb-c44f8d566
      updatedReplicas: 3
  tikv:
    failureStores:
      "1":
        podName: demo-tikv-2
        storeID: "1"
      "4":
        podName: demo-tikv-0
        storeID: "4"
      "60":
        podName: demo-tikv-3
        storeID: "60"
    phase: Normal
    statefulSet:
      collisionCount: 0
      currentReplicas: 6
      currentRevision: demo-tikv-5c55f6546c
      observedGeneration: 4
      readyReplicas: 4
      replicas: 6
      updateRevision: demo-tikv-5c55f6546c
      updatedReplicas: 6
    stores:
      "1":
        id: "1"
        ip: demo-tikv-2.demo-tikv-peer.tidb.svc
        lastHeartbeatTime: 2019-04-03T14:48:21Z
        lastTransitionTime: 2019-04-03T14:50:43Z
        leaderCount: 0
        podName: demo-tikv-2
        state: Down
      "4":
        id: "4"
        ip: demo-tikv-0.demo-tikv-peer.tidb.svc
        lastHeartbeatTime: 2019-04-03T15:41:34Z
        lastTransitionTime: 2019-04-03T15:12:43Z
        leaderCount: 0
        podName: demo-tikv-0
        state: Up
      "5":
        id: "5"
        ip: demo-tikv-1.demo-tikv-peer.tidb.svc
        lastHeartbeatTime: 2019-04-03T15:41:41Z
        lastTransitionTime: 2019-04-03T14:38:46Z
        leaderCount: 10
        podName: demo-tikv-1
        state: Up
      "60":
        id: "60"
        ip: demo-tikv-3.demo-tikv-peer.tidb.svc
        lastHeartbeatTime: 2019-04-03T15:41:35Z
        lastTransitionTime: 2019-04-03T15:10:43Z
        leaderCount: 1
        podName: demo-tikv-3
        state: Up
      "77":
        id: "77"
        ip: demo-tikv-4.demo-tikv-peer.tidb.svc
        lastHeartbeatTime: 2019-04-03T15:41:37Z
        lastTransitionTime: 2019-04-03T14:59:13Z
        leaderCount: 3
        podName: demo-tikv-4
        state: Up
    synced: true

The store "4" and "60" are up, but they are still listed in failureStores. The spec.tikv.replicas=3, however there are 4 up tikvs. The demo-tikv-5 is pending due to resource limitation (there are only 5 nodes).

zyguan commented 5 years ago

I see it's a known issue. Can we stop the pending store once it’s not needed?

weekface commented 5 years ago

This is the expected behavior, the operator should not automatically delete the additional store, requiring manual intervention.