pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.22k stars 493 forks source link

tidb failed connected to pump #441

Closed xiaojingchen closed 5 years ago

xiaojingchen commented 5 years ago

Bug Report

What version of Kubernetes are you using?

Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-24T06:54:59Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-24T06:43:59Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

What version of TiDB Operator are you using? least version

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

[root@172.16.4.173 ~]# kubectl get sc
NAME            PROVISIONER                    AGE
local-storage   kubernetes.io/no-provisioner   15d

What's the status of the TiDB cluster pods?

[root@172.16.4.173 ~]# kubectl get po -n stability-cluster1 -o wide
NAME                                            READY   STATUS      RESTARTS   AGE    IP               NODE           NOMINATED NODE
cluster-restore-discovery-5f8b865766-n2zv4      1/1     Running     0          117m   10.233.77.81     172.16.4.174   <none>
cluster-restore-monitor-86bc854b45-2zvzt        2/2     Running     0          117m   10.233.71.213    172.16.4.177   <none>
cluster-restore-pd-0                            1/1     Running     1          117m   10.233.102.10    172.16.4.175   <none>
cluster-restore-pd-1                            1/1     Running     0          117m   10.233.72.92     172.16.4.178   <none>
cluster-restore-pd-2                            1/1     Running     0          117m   10.233.71.238    172.16.4.177   <none>
cluster-restore-restore-backup-xzb2l            0/1     Completed   0          112m   10.233.72.106    172.16.4.178   <none>
cluster-restore-tidb-0                          1/1     Running     0          70m    10.233.102.132   172.16.4.173   <none>
cluster-restore-tidb-1                          1/1     Running     0          116m   10.233.77.67     172.16.4.174   <none>
cluster-restore-tidb-initializer-mmvp9          0/1     Completed   4          117m   10.233.72.83     172.16.4.178   <none>
cluster-restore-tikv-0                          2/2     Running     0          116m   10.233.71.205    172.16.4.177   <none>
cluster-restore-tikv-1                          2/2     Running     0          116m   10.233.103.111   172.16.4.176   <none>
cluster-restore-tikv-2                          2/2     Running     2          116m   10.233.102.14    172.16.4.175   <none>
get-backup-dir                                  1/1     Running     1          91m    10.233.72.119    172.16.4.178   <none>
stability-cluster1-backup-vlhp7                 0/1     Completed   0          115m   10.233.72.118    172.16.4.178   <none>
stability-cluster1-discovery-7b9f7db988-kztrw   1/1     Running     0          132m   10.233.77.121    172.16.4.174   <none>
stability-cluster1-drainer-0                    1/1     Running     0          90m    10.233.71.246    172.16.4.177   <none>
stability-cluster1-monitor-69b8c6b68f-xrl6s     2/2     Running     0          132m   10.233.77.123    172.16.4.174   <none>
stability-cluster1-pd-0                         1/1     Running     0          121m   10.233.103.78    172.16.4.176   <none>
stability-cluster1-pd-1                         1/1     Running     0          121m   10.233.71.211    172.16.4.177   <none>
stability-cluster1-pd-2                         1/1     Running     0          122m   10.233.72.77     172.16.4.178   <none>
stability-cluster1-pump-0                       1/1     Running     0          70m    10.233.102.156   172.16.4.173   <none>
stability-cluster1-tidb-0                       1/1     Running     0          89m    10.233.71.255    172.16.4.177   <none>
stability-cluster1-tidb-1                       1/1     Running     0          90m    10.233.77.73     172.16.4.174   <none>
stability-cluster1-tidb-initializer-bn24h       0/1     Completed   4          132m   10.233.71.229    172.16.4.177   <none>
stability-cluster1-tikv-0                       2/2     Running     1          119m   10.233.103.90    172.16.4.176   <none>
stability-cluster1-tikv-1                       2/2     Running     1          120m   10.233.72.127    172.16.4.178   <none>
stability-cluster1-tikv-2                       2/2     Running     1          121m   10.233.77.65     172.16.4.174   <none>

What did you do? do failover test and pumb pod be restarted What did you see instead? i found the tidb server became unavailable the tidb's log :

2019/04/30 09:22:10.119 client.go:288: [warning] [pumps client] write binlog to pump stability-cluster1-pump-0:8250 (type: Commit, start ts: 408057515149361211, commit ts: 408057515214897172, length: 45) error rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup stability-cluster1-pump-0.stability-cluster1-pump on 10.233.0.10:53: no such host"
2019/04/30 09:22:10.119 client.go:288: [warning] [pumps client] write binlog to pump stability-cluster1-pump-0:8250 (type: Commit, start ts: 408057515175837720, commit ts: 408057515228528648, length: 45) error rpc error: code = Unavailable desc = all SubConns are in^C TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup stability-cluster1-pump-0.stability-cluster1-pump on 10.233.0.10:53: no such host"
2019/04/30 09:22:10.119 client.go:288: [warning] [pumps client] write binlog to pump stability-cluster1-pump-0:8250 (type: Commit, start ts: 408057515175837736, commit ts: 408057515202052160, length: 45) error rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup stability-cluster1-pump-0.stability-cluster1-pump on 10.233.0.10:53: no such host"
weekface commented 5 years ago

pump start args:

/pump -L=info -advertise-addr=stability-cluster1-pump-0.stability-cluster1-pump:8250 -config=/etc/pump/pump.toml -log-file=
xiaojingchen commented 5 years ago

the pump is normal the log of pump:

2019/04/30 11:24:42 server.go:506: [info] writeBinlogCount: 0, alivePullerCount: 0,  maxCommitTS: 408059505503895553
2019/04/30 11:24:52 storage.go:355: [info] DBStats: {WriteDelayCount:0 WriteDelayDuration:0s WritePaused:false AliveSnapshots:0 AliveIterators:0 IOWrite:897117 IORead:1358741 BlockCacheSize:21751 OpenedTablesCount:1 LevelSizes:[467062] LevelTablesCounts:[1] LevelRead:[0] LevelWrite:[0] LevelDurations:[0s]}
2019/04/30 11:24:52 server.go:506: [info] writeBinlogCount: 0, alivePullerCount: 0,  maxCommitTS: 408059508649623553
2019/04/30 11:25:02 storage.go:355: [info] DBStats: {WriteDelayCount:0 WriteDelayDuration:0s WritePaused:false AliveSnapshots:0 AliveIterators:0 IOWrite:897694 IORead:1358741 BlockCacheSize:21751 OpenedTablesCount:1 LevelSizes:[467062] LevelTablesCounts:[1] LevelRead:[0] LevelWrite:[0] LevelDurations:[0s]}
2019/04/30 11:25:02 server.go:506: [info] writeBinlogCount: 0, alivePullerCount: 0,  maxCommitTS: 408059511008919553
2019/04/30 11:25:12 storage.go:355: [info] DBStats: {WriteDelayCount:0 WriteDelayDuration:0s WritePaused:false AliveSnapshots:0 AliveIterators:0 IOWrite:898186 IORead:1358741 BlockCacheSize:21751 OpenedTablesCount:1 LevelSizes:[467062] LevelTablesCounts:[1] LevelRead:[0] LevelWrite:[0] LevelDurations:[0s]}
2019/04/30 11:25:12 server.go:506: [info] writeBinlogCount: 0, alivePullerCount: 0,  maxCommitTS: 408059513368215553
2019/04/30 11:25:22 storage.go:355: [info] DBStats: {WriteDelayCount:0 WriteDelayDuration:0s WritePaused:false AliveSnapshots:0 AliveIterators:0 IOWrite:898757 IORead:1358741 BlockCacheSize:21751 OpenedTablesCount:1 LevelSizes:[467062] LevelTablesCounts:[1] LevelRead:[0] LevelWrite:[0] LevelDurations:[0s]}
2019/04/30 11:25:22 server.go:506: [info] writeBinlogCount: 0, alivePullerCount: 0,  maxCommitTS: 408059516513943553
2019/04/30 11:25:32 storage.go:355: [info] DBStats: {WriteDelayCount:0 WriteDelayDuration:0s WritePaused:false AliveSnapshots:0 AliveIterators:0 IOWrite:899334 IORead:1358741 BlockCacheSize:21751 OpenedTablesCount:1 LevelSizes:[467062] LevelTablesCounts:[1] LevelRead:[0] LevelWrite:[0] LevelDurations:[0s]}
2019/04/30 11:25:32 server.go:506: [info] writeBinlogCount: 0, alivePullerCount: 0,  maxCommitTS: 408059518886346753
2019/04/30 11:25:42 storage.go:355: [info] DBStats: {WriteDelayCount:0 WriteDelayDuration:0s WritePaused:false AliveSnapshots:0 AliveIterators:0 IOWrite:899826 IORead:1358741 BlockCacheSize:21751 OpenedTablesCount:1 LevelSizes:[467062] LevelTablesCounts:[1] LevelRead:[0] LevelWrite:[0] LevelDurations:[0s]}
2019/04/30 11:25:42 server.go:506: [info] writeBinlogCount: 0, alivePullerCount: 0,  maxCommitTS: 408059521245642753
2019/04/30 11:25:52 storage.go:355: [info] DBStats: {WriteDelayCount:0 WriteDelayDuration:0s WritePaused:false AliveSnapshots:0 AliveIterators:0 IOWrite:900397 IORead:1358741 BlockCacheSize:21751 OpenedTablesCount:1 LevelSizes:[467062] LevelTablesCounts:[1] LevelRead:[0] LevelWrite:[0] LevelDurations:[0s]}
2019/04/30 11:25:52 server.go:506: [info] writeBinlogCount: 0, alivePullerCount: 0,  maxCommitTS: 408059524391370753
2019/04/30 11:26:02 storage.go:355: [info] DBStats: {WriteDelayCount:0 WriteDelayDuration:0s WritePaused:false AliveSnapshots:0 AliveIterators:0 IOWrite:900974 IORead:1358741 BlockCacheSize:21751 OpenedTablesCount:1 LevelSizes:[467062] LevelTablesCounts:[1] LevelRead:[0] LevelWrite:[0] LevelDurations:[0s]}
2019/04/30 11:26:02 server.go:506: [info] writeBinlogCount: 0, alivePullerCount: 0,  maxCommitTS: 408059526750666753
2019/04/30 11:26:12 storage.go:355: [info] DBStats: {WriteDelayCount:0 WriteDelayDuration:0s WritePaused:false AliveSnapshots:0 AliveIterators:0 IOWrite:901466 IORead:1358741 BlockCacheSize:21751 OpenedTablesCount:1 LevelSizes:[467062] LevelTablesCounts:[1] LevelRead:[0] LevelWrite:[0] LevelDurations:[0s]}
2019/04/30 11:26:12 server.go:506: [info] writeBinlogCount: 0, alivePullerCount: 0,  maxCommitTS: 408059529109962753
2019/04/30 11:26:22 storage.go:355: [info] DBStats: {WriteDelayCount:0 WriteDelayDuration:0s WritePaused:false AliveSnapshots:0 AliveIterators:0 IOWrite:902037 IORead:1358741 BlockCacheSize:21751 OpenedTablesCount:1 LevelSizes:[467062] LevelTablesCounts:[1] LevelRead:[0] LevelWrite:[0] LevelDurations:[0s]}
2019/04/30 11:26:22 server.go:506: [info] writeBinlogCount: 0, alivePullerCount: 0,  maxCommitTS: 408059532255690753
weekface commented 5 years ago

@july2993 ptal

xiaojingchen commented 5 years ago

why the tidb use the incomplete domain stability-cluster1-pump-0? the correct domain is stability-cluster1-pump-0.stability-cluster1-pump

july2993 commented 5 years ago

what's the tidb and pump version, i test on the latest version and it can work.

weekface commented 5 years ago

@xiaojingchen please supply the tidb and pump verison.

xiaojingchen commented 5 years ago
/ # /pump -V
2019/05/05 08:05:25 version.go:18: [info] Git Commit Hash: 50f5b19d55371b28840f56b778db48420cff761a
2019/05/05 08:05:25 version.go:19: [info] Build TS: 2019-01-26 11:47:07
2019/05/05 08:05:25 version.go:20: [info] Go Version: go1.11.2
2019/05/05 08:05:25 version.go:21: [info] Go OS/Arch: linuxamd64
/ # /tidb-server -V
Release Version: v2.1.4
Git Commit Hash: a1d10c17039ec289ef490481b25c7091dba7ac31
Git Branch: release-2.1
UTC Build Time: 2019-02-15 06:24:33
GoVersion: go version go1.11.2 linux/amd64
Race Enabled: false
TiKV Min Version: 2.1.0-alpha.1-ff3dd160846b7d1aed9079c389fc188f7f5ea13e
Check Table Before Drop: false
tennix commented 5 years ago

@july2993 Is this fixed in TiDB v2.1.8?

july2993 commented 5 years ago
2019/04/30 09:22:10.119 client.go:288: [warning] [pumps client] write binlog to pump stability-cluster1-pump-0:8250 (type: Commit, start ts: 408057515149361211, commit ts: 408057515214897172, length: 45) error rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup stability-cluster1-pump-0.stability-cluster1-pump on 10.233.0.10:53: no such host"

i check the code, in the log write binlog to pump stability-cluster1-pump-0:8250stability-cluster1-pump-0:8250 is the NodeID of pump, and is get from the host name default.

but it do connect to stability-cluster1-pump-0.stability-cluster1-pump, and failed because error: desc = "transport: Error while dialing dial tcp: lookup stability-cluster1-pump-0.stability-cluster1-pump on 10.233.0.10:53: no such host"

i check the log of tidb, the log is happen before 2019/04/30 09:28:08.891, but the logs of pump by kubectl logs stability-cluster1-pump-0 -n stability-cluster1 > out looks the last start time is

stability-cluster1-pump-2019/04/30 09:34:51 version.go:18: ^[[0;37m[info] Git Commit Hash: 50f5b19d55371b28840f56b778db48420cff761a^[[0m
2019/04/30 09:34:51 version.go:19: ^[[0;37m[info] Build TS: 2019-01-26 11:47:07^[[0m
2019/04/30 09:34:51 version.go:20: ^[[0;37m[info] Go Version: go1.11.2^[[0m
2019/04/30 09:34:51 version.go:21: ^[[0;37m[info] Go OS/Arch: linuxamd64^[[0m

maybe at that time pump is not started, so it failed to write.

xiaojingchen commented 5 years ago

@july2993 why the tidb is still failed when the pump had started? I think the tidb should connect to pump when the pump started success.

xiaojingchen commented 5 years ago

Pump is mandatory to start before tidb, and for data security, the tidb must be unavailable when the pump is down, at which point it must be restarted manually. At present, there is no good solution to this problem. It is suggested to put it on hold for the time being.