pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.23k stars 498 forks source link

TiDB/pump pods keep running when it's unable to respond #605

Closed sokada1221 closed 4 years ago

sokada1221 commented 5 years ago

Bug Report

What version of Kubernetes are you using?

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-19T22:12:47Z", GoVersion:"go1.12.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.6-eks-d69f1b", GitCommit:"d69f1bf3669bf00b7f4a758e978e0e7a1e3a68f7", GitTreeState:"clean", BuildDate:"2019-02-28T20:26:10Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

What version of TiDB Operator are you using?

$ kubectl exec -n tidb-admin tidb-controller-manager-545d6c854d-7sbpr -- tidb-controller-manager -V
TiDB Operator Version: version.Info{TiDBVersion:"2.1.0", GitVersion:"v1.0.0-beta.3", GitCommit:"6257dfaad68f55f745f20f6f5d19b10bea2b0bea", GitTreeState:"clean", BuildDate:"2019-06-06T09:51:04Z", GoVersion:"go1.12", Compiler:"gc", Platform:"linux/amd64"}

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

$ kubectl get sc
NAME            PROVISIONER                    AGE
ebs-gp2         kubernetes.io/aws-ebs          5d2h
gp2 (default)   kubernetes.io/aws-ebs          5d12h
local-storage   kubernetes.io/no-provisioner   5d2h
$ kubectl get pvc -n tidb
NAME                     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS    AGE
data-stgtidb-drainer-0   Bound    pvc-309ca526-945e-11e9-aa2c-0ec2fc371712   10Gi       RWO            ebs-gp2         3d19h
data-stgtidb-pump-0      Bound    pvc-30a29fa3-945e-11e9-aa2c-0ec2fc371712   10Gi       RWO            ebs-gp2         3d19h
pd-stgtidb-pd-0          Bound    local-pv-67a5cf5                           183Gi      RWO            local-storage   5d2h
stgtidb-monitor          Bound    pvc-2fbd2b63-935b-11e9-b73a-0699f94e5a34   500Gi      RWO            ebs-gp2         5d2h
tikv-stgtidb-tikv-0      Bound    local-pv-34d3afd2                          183Gi      RWO            local-storage   5d2h

What's the status of the TiDB cluster pods?

$ kubectl get po -n tidb -o wide
NAME                                 READY   STATUS    RESTARTS   AGE    IP           NODE                                            NOMINATED NODE
stgtidb-discovery-6c7fc47b94-qn2sm   1/1     Running   0          5d2h   10.0.2.135   ip-10-0-2-65.ap-northeast-1.compute.internal    <none>
stgtidb-drainer-0                    1/1     Running   3          20h    10.0.3.80    ip-10-0-3-81.ap-northeast-1.compute.internal    <none>
stgtidb-monitor-77bdf6c857-59xtl     2/2     Running   0          22h    10.0.2.178   ip-10-0-2-65.ap-northeast-1.compute.internal    <none>
stgtidb-pd-0                         1/1     Running   0          5d2h   10.0.1.34    ip-10-0-1-115.ap-northeast-1.compute.internal   <none>
stgtidb-pump-0                       1/1     Running   4          20h    10.0.3.204   ip-10-0-3-81.ap-northeast-1.compute.internal    <none>
stgtidb-tidb-0                       1/1     Running   0          19h    10.0.1.139   ip-10-0-1-103.ap-northeast-1.compute.internal   <none>
stgtidb-tikv-0                       1/1     Running   0          5d2h   10.0.3.189   ip-10-0-3-13.ap-northeast-1.compute.internal    <none>

What did you do? Installed tidb-operator and tidb-cluster according to the installation guide. Additionally, pump & drainer are deployed as you can see from the list of pods above.

What did you expect to see? When TiDB/pump cannot respond to any request, the liveness probe should fail and k8s should try to restart pods automatically.

What did you see instead? MySQL clients cannot connect with the following error i.e. TiDB cannot serve clients anymore

$ mysql -h master -P 4000 -u root
ERROR 2003 (HY000): Can't connect to MySQL server on 'master' (111)

TiDB pod log is filled with the following message towards the end:

[2019/06/25 15:05:16.191 +00:00] [ERROR] [server.go:365] ["listener stopped, waiting for manual kill."] [stack="github.com/pingcap/tidb/server.(*Server).Run\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/server/server.go:365\nmain.runServer\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/tidb-server/main.go:564\nmain.main\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/tidb-server/main.go:173\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"]

I encountered this issue twice so far, and I believe it happens when pump service is unstable. For instance, this time, pump ran out of disk space.

[2019/06/25 15:24:13.905 +00:00] [ERROR] [storage.go:664] ["write batch failed"] [error="write /data/kv/000056.log: no space left on device"]
suzaku commented 5 years ago

Could you please search for "write binlog failed" in TiDB logs so that we can confirm that it's the broken pump that caused the issue?

WangXiangUSTC commented 5 years ago

it may because tidb write binlog to pump failed, you need search binlog log as @suzaku said. And if find these log, please check pump's status, you can provide pump's log and I will help you check it.

sokada1221 commented 5 years ago

@suzaku Yes, the log also contained a lot of these as well:

[2019/06/25 11:52:07.335 +00:00] [WARN] [client.go:288] ["[pumps client] write binlog to pump failed"] [NodeID=stgtidb-pump-0:8250] ["binlog type"=Prewrite] ["start ts"=409328293240111125] ["commit ts"=0] [length=2795] [error="rpc error: code = Unknown desc = unable to write to log file: /data/value/000022.vlog: write /data/value/000022.vlog: no space left on device"]

@WangXiangUSTC Pump's status was running as you can see from the output above. Sorry, I re-deployed a new cluster so I don't have the full pump log anymore. If I reproduce again, I'll post it here. Thanks.

WangXiangUSTC commented 5 years ago

the error log shows no space left on device, it is because of no space left on your pump's server

suzaku commented 5 years ago

@shinnosuke-okada

Here's what we know right now:

  1. There's only 1 Pump deployed
  2. The Pump service is running
  3. There's not enough disk space left for the Pump service to save new binlogs

The pump_client used in TiDB would automatically try different Pumps if there's more than one of them available. Since there's only one running Pump here and it wasn't writable, TiDB itself refused new requests to avoid losing binlogs.

The Pump service is running but not writable because of the no space left on device error, it seems that this is not the kind of problem that can be solved by k8s automatically restarting the service.

weekface commented 5 years ago

@shinnosuke-okada

we can scale the pump service to get more space:

do we need to restart all the tidb pods when we add more pump pods?

sokada1221 commented 5 years ago

Thank you all for your input! Yes, I understand how pump is running out of disk space, and how that's causing tidb to fail.

My concern in this ticket is rather - shouldn't the pods' state reflect how they cannot respond to any request? As @suzaku says, restarting won't fix the problem but after reaching the retry threshold, the state should at least reflect how they're essentially dead (e.g. ERROR).

suzaku commented 5 years ago

Yes, from the perspective of k8s, we only know that the pump process is running but we don't know for sure if it's ready for requests.

pump_client sends empty binlogs periodically to check if any of the failed pump instances has resumed. But it won't work in this case, "writing" empty binlogs will succeed even when the disk is full because they never actually get saved.

We need to come up with a better way to healthcheck the pump services.

suzaku commented 5 years ago

@shinnosuke-okada Currently, the best way known to monitor the status of pump services maybe to use the metrics collected in Prometheus to create alerts.

sokada1221 commented 5 years ago

@suzaku Good call! I'll work on the alerts after getting the major parts working in our PoC cluster. Thanks.

aylei commented 4 years ago

the behavior is desired, closing. feel free to reopen this issue if you have any question @shinnosuke-okada