pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.22k stars 493 forks source link

tidb crash loop when enabling binlog #4945

Open hoyhbx opened 1 year ago

hoyhbx commented 1 year ago

Bug Report

What version of Kubernetes are you using?

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:41:01Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.9", GitCommit:"6df4433e288edc9c40c2e344eb336f63fad45cd2", GitTreeState:"clean", BuildDate:"2022-05-19T19:53:08Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

What version of TiDB Operator are you using?

pingcap/tidb-operator:v1.3.2

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

NAME                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
pd-test-cluster-pd-0       Bound    pvc-b279f523-e7f7-40d2-a04e-a234dd33b454   10Gi       RWO            standard       43m
pd-test-cluster-pd-1       Bound    pvc-3ed54aef-dbfa-40ef-840e-6bc905819c14   10Gi       RWO            standard       43m
pd-test-cluster-pd-2       Bound    pvc-497a34c5-2c6f-4fe6-b03c-d6a4ad33826d   10Gi       RWO            standard       43m
tikv-test-cluster-tikv-0   Bound    pvc-8098c781-6802-43d3-af6c-407bf78b79b0   100Gi      RWO            standard       43m
tikv-test-cluster-tikv-1   Bound    pvc-3cd2460a-6d2a-4ca3-894f-7136132a28d0   100Gi      RWO            standard       43m
tikv-test-cluster-tikv-2   Bound    pvc-75807bb7-8034-4c4b-86b3-de76fedfa620   100Gi      RWO            standard       43m

What's the status of the TiDB cluster pods?

NAME                                      READY   STATUS             RESTARTS      AGE   IP           NODE           NOMINATED NODE   READINESS GATES
test-cluster-discovery-779bb58fc7-wbkcs   1/1     Running            0             33m   10.244.4.7   kind-worker4   <none>           <none>
test-cluster-pd-0                         1/1     Running            0             33m   10.244.1.7   kind-worker3   <none>           <none>
test-cluster-pd-1                         1/1     Running            0             33m   10.244.4.8   kind-worker4   <none>           <none>
test-cluster-pd-2                         1/1     Running            0             33m   10.244.2.6   kind-worker    <none>           <none>
test-cluster-tidb-0                       2/2     Running            0             33m   10.244.2.7   kind-worker    <none>           <none>
test-cluster-tidb-1                       2/2     Running            0             33m   10.244.1.9   kind-worker3   <none>           <none>
test-cluster-tidb-2                       1/2     CrashLoopBackOff   11 (7s ago)   31m   10.244.3.7   kind-worker2   <none>           <none>
test-cluster-tikv-0                       2/2     Running            0             33m   10.244.1.8   kind-worker3   <none>           <none>
test-cluster-tikv-1                       2/2     Running            0             33m   10.244.4.9   kind-worker4   <none>           <none>
test-cluster-tikv-2                       2/2     Running            0             33m   10.244.3.6   kind-worker2   <none>           <none>

What did you do?

We enabled the binlog in TiDB, by turning spec.tidb.binlogEnabled to true

What did you expect to see? binlog is enabled by tidb

What did you see instead? TiDB replica keeps crashing. The error log indicates that the TiDB node cannot find pump in the cluster. There seems to be an implicit dependency to enable the binlog for TiDB. Maybe the operator should reject the request to enable binlog if the pump does not exist.

csuzhangxc commented 1 year ago

Maybe the operator should reject the request to enable binlog if the pump does not exist.

This may be done with webhook.

hoyhbx commented 1 year ago

I can try to implement the fix if you could point me to the desired location to implement the validation. I think it is also possible to avoid by using the CEL support in the CRD so that the validation is done in the APIServer

csuzhangxc commented 1 year ago

TiDB-Operator didn't have a well-implemented webhook now. So using CEL may be a good solution.