pingcap / tidb

TiDB - the open-source, cloud-native, distributed SQL database designed for modern applications.
https://pingcap.com
Apache License 2.0
37.14k stars 5.83k forks source link

Make TiDB server shutdown gracefully when PD is dead #18336

Open tiancaiamao opened 4 years ago

tiancaiamao commented 4 years ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Run a cluster, kill pd, then kill tidb-server (Ctrl - C)

2. What did you expect to see? (Required)

tidb-server exit

3. What did you see instead (Required)

The process print a log of error log and never exit.

kill -USR1 pid to get the goroutine stack:

goroutine 1 [semacquire, 17 minutes]:
sync.runtime_Semacquire(0xc0004798b0)
        /media/genius/OS/project/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc0004798a8)
        /media/genius/OS/project/go/src/sync/waitgroup.go:130 +0x64
github.com/pingcap/tidb/owner.(*ownerManager).Cancel(0xc000479830)
        /media/genius/OS/project/src/github.com/pingcap/tidb/owner/manager.go:121 +0x3d
github.com/pingcap/tidb/ddl.(*ddl).close(0xc0001b7f20)
        /media/genius/OS/project/src/github.com/pingcap/tidb/ddl/ddl.go:365 +0xbc
github.com/pingcap/tidb/ddl.(*ddl).Stop(0xc0001b7f20, 0x0, 0x0)
        /media/genius/OS/project/src/github.com/pingcap/tidb/ddl/ddl.go:296 +0x95
github.com/pingcap/tidb/domain.(*Domain).Close(0xc0005d2900)
        /media/genius/OS/project/src/github.com/pingcap/tidb/domain/domain.go:608 +0x2bf
main.closeDomainAndStorage()
        /media/genius/OS/project/src/github.com/pingcap/tidb/tidb-server/main.go:718 +0x3f
main.cleanup()
        /media/genius/OS/project/src/github.com/pingcap/tidb/tidb-server/main.go:730 +0x80
main.main()
        /media/genius/OS/project/src/github.com/pingcap/tidb/tidb-server/main.go:188 +0x1f1

It is block on domain.Close, and waiting for ownerManager to exit. However, ownerManager is doing its CampaignOwner loop and it seems this loop never end ...

go.etcd.io/etcd/clientv3.(*txn).Commit(0xc000bff4d0, 0x0, 0x0, 0x0)
        /media/genius/OS/project/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/clientv3/txn.go:146 +0x16f
go.etcd.io/etcd/clientv3/concurrency.(*Election).Resign(0xc000690540, 0x3adc4a0, 0xc0004d3f00, 0xc000397fe0, 0x13)
        /media/genius/OS/project/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/clientv3/concurrency/election.go:138 +0x3b6
go.etcd.io/etcd/clientv3/concurrency.(*Election).Campaign(0xc000690540, 0x3adc4a0, 0xc000690500, 0xc000052c90, 0x24, 0xc000397fe0, 0x13)
        /media/genius/OS/project/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/clientv3/concurrency/election.go:98 +0x798
github.com/pingcap/tidb/owner.(*ownerManager).campaignLoop(0xc000479830, 0xc000678570)
        /media/genius/OS/project/src/github.com/pingcap/tidb/owner/manager.go:274 +0x6fe
created by github.com/pingcap/tidb/owner.(*ownerManager).CampaignOwner
        /media/genius/OS/project/src/github.com/pingcap/tidb/owner/manager.go:194 +0x343

4. Affected version (Required)

master f31298f5bb55d0c37dcd95c30d0253deef6b850e

5. Root Cause Analysis

zimulala commented 4 years ago

PTAL @AilinKid

wjhuang2016 commented 4 years ago

@AilinKid Any update?

morgo commented 3 years ago

I think this is a duplicate of https://github.com/pingcap/tidb/issues/10260 - please close both if you fix it :-)

AilinKid commented 3 years ago

I've investigated for a while. 1: the domain got some work to do before close TiDB, such as clean some registered information in PD, the ETCD campaign needs pdclient.done() signal to break the loop, However, the PD client is closed in the store layer which is after the domain close action. Putting it ahead seems valid but not that elegant because skipping clean is not a good manner.

2: Besides, there many sub-goroutines that need oracle ts to push forward their job, while the lost PD will cause the RPC request to fail, the backoff ctx (constructed with context.Background ) is the only channel that can break their loop.

Seems hard to find an elegant way to exit it without PD...

AilinKid commented 3 years ago

Notice: After trail and error, we found it occurs in the master, v4.0, v3.0... (maybe we always got this phenomenon from every beginning)

we tried:

func (do *Domain) Close() {
    if do == nil {
        return
    }
    startTime := time.Now()

        // if you are sure the linked pb is dead, you can put etcdcli close action forward here. 
        // because the campaign loop /  ddl owner checker ... will use this client to send 
        // request, which will stuck in the block, unless you close the context in the etcdcli
        // actively.
        // After you clicking the ctrl+c in the TiDB, you will still need to wait about 3 minutes
        // because these background goutinues have some retry counts and try to pull it up.
        // Then, the domain will close, then store will close, finally TiDB will close.

    if do.etcdClient != nil {
        terror.Log(errors.Trace(do.etcdClient.Close()))
    }

    if do.ddl != nil {
        terror.Log(do.ddl.Stop())
    }
AilinKid commented 3 years ago

But letting etcdcli close firstly is not always a good choice, because you can not tell whether it is caused by PD death. If the PD is good, this code will ignore much information cleaning in the ETCD, and stats related info won't be stored in the TiKV, and ...

Judging whether PD is dead is subjective to your mind, you can send PD request for testing. However, the connection refused can also be caused by unstable network isolation.

AilinKid commented 3 years ago

So I think, for a subjective extreme case (PD is dead), letting the user close TiDB forcibly by kill -9 PID is much understandable. The lost info in the TiDB instance is also expected for the user.

AilinKid commented 3 years ago

PTAL @morgo @bb7133 @wjhuang2016 @tiancaiamao

morgo commented 3 years ago

The use case I was thinking about is that the pd-server is dead because of an incorrect shutdown order. i.e. you want to shutdown all components, but pd-server shuts down first. tidb-server and tikv-server already correctly handle the case of incorrect startup order (they will spin waiting for pd-server).

In a distributed system it is hard to ensure order, so its nice if shutdown can have the same properties as startup.

AilinKid commented 3 years ago

Make sense, maybe we can change this issue as a feature request and try to figure out an elegant way to do this.

bb7133 commented 2 years ago

I decide to make a feature request instead of a bug.