Open tiancaiamao opened 4 years ago
PTAL @AilinKid
@AilinKid Any update?
I think this is a duplicate of https://github.com/pingcap/tidb/issues/10260 - please close both if you fix it :-)
I've investigated for a while. 1: the domain got some work to do before close TiDB, such as clean some registered information in PD, the ETCD campaign needs pdclient.done() signal to break the loop, However, the PD client is closed in the store layer which is after the domain close action. Putting it ahead seems valid but not that elegant because skipping clean is not a good manner.
2: Besides, there many sub-goroutines that need oracle ts to push forward their job, while the lost PD will cause the RPC request to fail, the backoff ctx (constructed with context.Background ) is the only channel that can break their loop.
Seems hard to find an elegant way to exit it without PD...
we tried:
func (do *Domain) Close() {
if do == nil {
return
}
startTime := time.Now()
// if you are sure the linked pb is dead, you can put etcdcli close action forward here.
// because the campaign loop / ddl owner checker ... will use this client to send
// request, which will stuck in the block, unless you close the context in the etcdcli
// actively.
// After you clicking the ctrl+c in the TiDB, you will still need to wait about 3 minutes
// because these background goutinues have some retry counts and try to pull it up.
// Then, the domain will close, then store will close, finally TiDB will close.
if do.etcdClient != nil {
terror.Log(errors.Trace(do.etcdClient.Close()))
}
if do.ddl != nil {
terror.Log(do.ddl.Stop())
}
But letting etcdcli close firstly is not always a good choice, because you can not tell whether it is caused by PD death. If the PD is good, this code will ignore much information cleaning in the ETCD, and stats related info won't be stored in the TiKV, and ...
Judging whether PD is dead is subjective to your mind, you can send PD request for testing. However, the connection refused can also be caused by unstable network isolation.
So I think, for a subjective extreme case (PD is dead), letting the user close TiDB forcibly by kill -9 PID
is much understandable. The lost info in the TiDB instance is also expected for the user.
PTAL @morgo @bb7133 @wjhuang2016 @tiancaiamao
The use case I was thinking about is that the pd-server
is dead because of an incorrect shutdown order. i.e. you want to shutdown all components, but pd-server
shuts down first. tidb-server
and tikv-server
already correctly handle the case of incorrect startup order (they will spin waiting for pd-server).
In a distributed system it is hard to ensure order, so its nice if shutdown can have the same properties as startup.
Make sense, maybe we can change this issue as a feature request and try to figure out an elegant way to do this.
I decide to make a feature request instead of a bug.
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
Run a cluster, kill pd, then kill tidb-server (Ctrl - C)
2. What did you expect to see? (Required)
tidb-server exit
3. What did you see instead (Required)
The process print a log of error log and never exit.
kill -USR1 pid
to get the goroutine stack:It is block on domain.Close, and waiting for ownerManager to exit. However, ownerManager is doing its CampaignOwner loop and it seems this loop never end ...
4. Affected version (Required)
master f31298f5bb55d0c37dcd95c30d0253deef6b850e
5. Root Cause Analysis