pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.23k stars 499 forks source link

TiDB operator does not delete original configmap after user changes the config in the cr, causing resource leak #5741

Open kos-team opened 1 month ago

kos-team commented 1 month ago

Bug Report

What version of Kubernetes are you using? Client Version: v1.31.0 Kustomize Version: v5.4.2 Server Version: v1.29.1

What version of TiDB Operator are you using? v1.6.0

What's the status of the TiDB cluster pods? All pods are in Running state

What did you do? We updated the spec.tikv.config field to a different non-empty value.

How to reproduce

  1. Deploy a TiDB cluster, for example:
    apiVersion: pingcap.com/v1alpha1
    kind: TidbCluster
    metadata:
    name: test-cluster
    spec:
    configUpdateStrategy: RollingUpdate
    enableDynamicConfiguration: true
    helper:
    image: alpine:3.16.0
    pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
    pvReclaimPolicy: Retain
    ticdc:
    baseImage: pingcap/ticdc
    replicas: 3
    tidb:
    baseImage: pingcap/tidb
    config: "[performance]\n  tcp-keep-alive = true\ngraceful-wait-before-shutdown\
      \ = 30\n"
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
    tiflash:
    baseImage: pingcap/tiflash
    replicas: 3
    storageClaims:
    - resources:
        requests:
          storage: 10Gi
    tikv:
    baseImage: pingcap/tikv
    config: |
      [raftdb]
        max-open-files = 256
      [rocksdb]
        max-open-files = 256
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
    timezone: UTC
    version: v8.1.0
  2. Change the spec.tikv.config to another non-empty value, e.g.
    apiVersion: pingcap.com/v1alpha1
    kind: TidbCluster
    metadata:
    name: test-cluster
    spec:
    configUpdateStrategy: RollingUpdate
    enableDynamicConfiguration: true
    helper:
    image: alpine:3.16.0
    pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
    pvReclaimPolicy: Retain
    ticdc:
    baseImage: pingcap/ticdc
    replicas: 3
    tidb:
    baseImage: pingcap/tidb
    config: "[performance]\n  tcp-keep-alive = true\ngraceful-wait-before-shutdown\
      \ = 30\n"
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
    tiflash:
    baseImage: pingcap/tiflash
    replicas: 3
    storageClaims:
    - resources:
        requests:
          storage: 10Gi
    tikv:
    baseImage: pingcap/tikv
    config: |
      [raftdb]
        max-open-files = 256
      [rocksdb]
        max-open-files = 128
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
    timezone: UTC
    version: v8.1.0

    What did you expect to see? We expected to see that the unused ConfigMaps are garbage collected by the TiDB operator. This prevents the operator from keeping generating new ConfigMaps and adding more objects into the etcd.

What did you see instead? The operator created a new ConfigMap for TiKV but left the old ConfigMap undeleted. We observed the same behavior when updating spec.tiflash.config, which suggests that all TiDB components are likely affected by this issue.

csuzhangxc commented 1 month ago

Currently, we generate new ConfigMap for RollingUpdate ConfigUpdateStrategy. Only keep some recent ConfigMaps and delete other older ones may be better.