pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.24k stars 499 forks source link

TiDB-operator fails to start the tiproxy servers if spec.tiproxy.version not provided #5833

Open kos-team opened 3 weeks ago

kos-team commented 3 weeks ago

Bug Report

What version of Kubernetes are you using? Client Version: v1.31.1 Kustomize Version: v5.4.2

What version of TiDB Operator are you using? v1.6.0

What's the status of the TiDB cluster pods? TiProxy pods are in CrashBackOffLoop State.

What did you do? We deployed a cluster with TiProxy.

How to reproduce

  1. Deploy a TiDB cluster with TiProxy enabled, for example:

    apiVersion: pingcap.com/v1alpha1
    kind: TidbCluster
    metadata:
    name: test-cluster
    spec:
    configUpdateStrategy: RollingUpdate
    enableDynamicConfiguration: true
    helper:
    image: alpine:3.16.0
    pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
    pvReclaimPolicy: Retain
    ticdc:
    baseImage: pingcap/ticdc
    replicas: 3
    tidb:
    baseImage: pingcap/tidb
    config: "[performance]\n  tcp-keep-alive = true\ngraceful-wait-before-shutdown\
      \ = 30\n"
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
    tiflash:
    baseImage: pingcap/tiflash
    replicas: 3
    storageClaims:
    - resources:
        requests:
          storage: 10Gi
    tikv:
    baseImage: pingcap/tikv
    config: 'log-level = "info"
    
      '
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
    scalePolicy:
      scaleOutParallelism: 5
    timezone: UTC
    tiproxy:
    replicas: 5
    sslEnableTiDB: true
    version: v8.1.0

What did you expect to see? TiProxy pods should start successfully and be in the Healthy state.

What did you see instead? The TiProxy pods kept crashing and be in CrashBackOffLoop state due to ErrImagePull.

Root Cause The root cause is that we specified spec.version to v8.1.0 which will be used for all components when pulling their images. However, there is no pingcap/tiproxy:v8.1.0 image available on the DockerHub causing the image pull process to fail for the TiProxy.

How to fix Since the image tag for TiProxy follows a different naming convention compared to other components like TiKV and TiFlash, we recommend setting a default value of main for spec.tiproxy.version. This will ensure the TiDB Operator overrides the version tag for TiProxy and pulls the correct image.

csuzhangxc commented 3 weeks ago

the main may not be stable, and we recommend the user to try the newest release version (vx.y.z)

kos-team commented 3 weeks ago

@csuzhangxc The main usability issue here is that, the TiProxy follows a different version numbering than the other TiDB components. And if we set a version v8.1.0 in the property spec.version, all TiDB components use v8.1.0 as the version. This works for all other components such as TiFlash, TiKV. However, since TiProxy does not have the same version number as the rest of the components, it would fail.

csuzhangxc commented 3 weeks ago

@csuzhangxc The main usability issue here is that, the TiProxy follows a different version numbering than the other TiDB components. And if we set a version v8.1.0 in the property spec.version, all TiDB components use v8.1.0 as the version. This works for all other components such as TiFlash, TiKV. However, since TiProxy does not have the same version number as the rest of the components, it would fail.

I know. I mean it's hard to choose a default value for TiProxy as we always recommend the user to use the newest version

kos-team commented 3 weeks ago

We also reported a related issue to the tidb upstream repo: https://github.com/pingcap/tidb/issues/56643, about the lastest tag not pointing to the actual latest version. It seems that those upstream systems do not have a reliable tag for using the latest version. It would be nice if they have a tag which can be used as the default value here.

kos-team commented 3 weeks ago

To make the deployment safer, I think the spec.tiproxy.image perhaps can be set to be a required property of the spec.tiproxy object. This enforces users to specify a TiProxy version when they enable it, since tiproxy cannot use the default value from the spec.version anyway.