pingcap / dm

Data Migration Platform
Apache License 2.0
456 stars 188 forks source link

evict leader can't take effect #1983

Closed lance6716 closed 3 years ago

lance6716 commented 3 years ago

Bug Report

re-explain of https://github.com/pingcap/dm/issues/1212

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.

evict leader

  1. What did you expect to see?

evict success

  1. What did you see instead?

leader can't be evicted

  1. Root cause

dm-ci.log.zip

in the log we can see

the current election key (consist of a user specified prefix and etcd session lease) is /dm-master/leader/690e7b3305c06104

[2021/08/11 10:23:46.449 +08:00] [DEBUG] [interceptor.go:181] ["request stats"] [component="embed etcd"] ["start time"=2021/08/11 10:23:46.441 +08:00] ["time spent"=7.229401ms] [remote=127.0.0.1:50072] ["response type"=/etcdserverpb.KV/Txn] ["request count"=1] ["request size"=90] ["response count"=0] ["response size"=38] ["request content"="compare:<target:CREATE key:\"/dm-master/leader/690e7b3305c06104\" create_revision:0 > success:<request_put:<key:\"/dm-master/leader/690e7b3305c06104\" value_size:40 lease:7570123482726424836 >> failure:<request_range:<key:\"/dm-master/leader/690e7b3305c06104\" > >"]

last election ends gracefully

[2021/08/11 10:23:53.483 +08:00] [INFO] [election.go:284] ["fail to campaign"] [component=election] ["current member"="{\"id\":\"master3\",\"addr\":\"localhost:8461\"}"] [error="context canceled"]

start a new election

[2021/08/11 10:23:53.483 +08:00] [DEBUG] [election.go:266] ["begin to campaign"] [component=election] ["current member"="{\"id\":\"master3\",\"addr\":\"localhost:8461\"}"]

this new election failed, and the inner recorded election key is empty

[2021/08/11 10:24:00.488 +08:00] [DEBUG] [election.go:271] ["before manually resign"] [component=election] ["current election key"=] ["current election header"=<nil>]
[2021/08/11 10:24:00.488 +08:00] [DEBUG] [election.go:279] ["after manually resign"] [component=election] ["current election key"=] ["current election header"=<nil>]
[2021/08/11 10:24:00.488 +08:00] [INFO] [election.go:284] ["fail to campaign"] [component=election] ["current member"="{\"id\":\"master3\",\"addr\":\"localhost:8461\"}"] [error="etcdserver: request timed out"]

this is caused by in

https://github.com/etcd-io/etcd/blob/ea24fb850762ce38155738aff5ae71368eadb904/client/v3/concurrency/election.go#L69-L81

Campaign returned at line 79 so e.leaderKey is not assigned. Another clue (need to check) is that the error is etcdserver: request timed out which means the success of committing is undeterminated.

and soon we watched the election key, so above commit is successful.

[2021/08/11 10:24:05.325 +08:00] [INFO] [election.go:306] ["get response from election observe"] [component=election] [key=/dm-master/leader/690e7b3305c06104] [value="{\"id\":\"master3\",\"addr\":\"localhost:8461\"}"]
[2021/08/11 10:24:05.325 +08:00] [INFO] [election.go:337] ["become leader"] [component=election] ["current member"="{\"id\":\"master3\",\"addr\":\"localhost:8461\"}"]

but current election is failed because of etcdserver: request timed out, so we can't delete that key by Resign when evict leader.

and next time we enter the compaign loop, we skip Campaign because e.evictLeader.Load(), so we have no chance to inherit the election key of last time and further delete it by Resign. So the orphanic election key have no change to be deleted. This causes the DM-master always become leader

https://github.com/pingcap/dm/blob/4868d4e011f445c7bc89fc3168483862b51b6302/pkg/election/election.go#L261-L268

lance6716 commented 3 years ago

cc @gozssky

The API of etcd maybe reasonable to some extent, because if we continue Campaign next time we can inherit the orphanic election key