Closed sokada1221 closed 4 years ago
Could you please search for "write binlog failed" in TiDB logs so that we can confirm that it's the broken pump that caused the issue?
it may because tidb write binlog to pump failed, you need search binlog log as @suzaku said. And if find these log, please check pump's status, you can provide pump's log and I will help you check it.
@suzaku Yes, the log also contained a lot of these as well:
[2019/06/25 11:52:07.335 +00:00] [WARN] [client.go:288] ["[pumps client] write binlog to pump failed"] [NodeID=stgtidb-pump-0:8250] ["binlog type"=Prewrite] ["start ts"=409328293240111125] ["commit ts"=0] [length=2795] [error="rpc error: code = Unknown desc = unable to write to log file: /data/value/000022.vlog: write /data/value/000022.vlog: no space left on device"]
@WangXiangUSTC Pump's status was running as you can see from the output above. Sorry, I re-deployed a new cluster so I don't have the full pump log anymore. If I reproduce again, I'll post it here. Thanks.
the error log shows no space left on device
, it is because of no space left on your pump's server
@shinnosuke-okada
Here's what we know right now:
The pump_client
used in TiDB would automatically try different Pumps if there's more than one of them available. Since there's only one running Pump here and it wasn't writable, TiDB itself refused new requests to avoid losing binlogs.
The Pump service is running but not writable because of the no space left on device
error, it seems that this is not the kind of problem that can be solved by k8s automatically restarting the service.
@shinnosuke-okada
we can scale the pump
service to get more space:
charts/tidb-cluster/values.yaml
2
or moreand then run helm upgrade <releaseName> charts/tidb-cluster
to scale out the pump service
@suzaku
do we need to restart all the tidb
pods when we add more pump
pods?
Thank you all for your input! Yes, I understand how pump
is running out of disk space, and how that's causing tidb
to fail.
My concern in this ticket is rather - shouldn't the pods' state reflect how they cannot respond to any request? As @suzaku says, restarting won't fix the problem but after reaching the retry threshold, the state should at least reflect how they're essentially dead (e.g. ERROR
).
Yes, from the perspective of k8s, we only know that the pump process is running but we don't know for sure if it's ready for requests.
pump_client
sends empty binlogs periodically to check if any of the failed pump instances has resumed. But it won't work in this case, "writing" empty binlogs will succeed even when the disk is full because they never actually get saved.
We need to come up with a better way to healthcheck
the pump services.
@shinnosuke-okada Currently, the best way known to monitor the status of pump services maybe to use the metrics collected in Prometheus to create alerts.
@suzaku Good call! I'll work on the alerts after getting the major parts working in our PoC cluster. Thanks.
the behavior is desired, closing. feel free to reopen this issue if you have any question @shinnosuke-okada
Bug Report
What version of Kubernetes are you using?
What version of TiDB Operator are you using?
What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?
What's the status of the TiDB cluster pods?
What did you do? Installed tidb-operator and tidb-cluster according to the installation guide. Additionally, pump & drainer are deployed as you can see from the list of pods above.
What did you expect to see? When TiDB/pump cannot respond to any request, the liveness probe should fail and k8s should try to restart pods automatically.
What did you see instead? MySQL clients cannot connect with the following error i.e. TiDB cannot serve clients anymore
TiDB pod log is filled with the following message towards the end:
I encountered this issue twice so far, and I believe it happens when pump service is unstable. For instance, this time, pump ran out of disk space.