TSO primary election took 14 minutes after PD(API) Pods being deleted at the same time and PD(API) leader being re-elected

binshi-bing commented 1 year ago

Enhancement Task

What did you do? In dev env,

kubectl delete pod serverless-cluster-pd-0 serverless-cluster-pd-1 serverless-cluster-pd-2 -n tidb-serverless --force --grace-period=0

tso_0 secondary

[2023/06/02 03:06:31.612 +00:00] [ERROR] [register.go:82] ["keep alive failed"] [key=/ms/7187976276065784319/tso/registry/http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379]
**repeat the following 4 lines every 1s**
[2023/06/02 03:06:46.624 +00:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=endpoint://client-bf54c660-20ab-44b5-ab7b-5bc8cbf44d8b/serverless-cluster-pd.tidb-serverless.svc:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2023/06/02 03:06:46.624 +00:00] [WARN] [etcdutil.go:136] ["kv gets too slow"] [request-key=/ms/7187976276065784319/tso/00000/dc-location] [cost=10.000772311s] [error="context deadline exceeded"]
[2023/06/02 03:06:46.624 +00:00] [ERROR] [etcdutil.go:141] ["load from etcd meet error"] [key=/ms/7187976276065784319/tso/00000/dc-location] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded"]
[2023/06/02 03:06:46.624 +00:00] [ERROR] [allocator_manager.go:848] ["get cluster dc-locations from etcd failed"] [] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded"]
...
[2023/06/02 03:20:46.623 +00:00] [WARN] [etcdutil.go:136] ["kv gets too slow"] [request-key=/ms/7187976276065784319/tso/00000/dc-location] [cost=10.000127145s] [error="context deadline exceeded"]
[2023/06/02 03:20:46.623 +00:00] [ERROR] [etcdutil.go:141] ["load from etcd meet error"] [key=/ms/7187976276065784319/tso/00000/dc-location] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded"]
[2023/06/02 03:20:46.623 +00:00] [ERROR] [allocator_manager.go:848] ["get cluster dc-locations from etcd failed"] [] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded"]
[2023/06/02 03:20:46.623 +00:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=endpoint://client-bf54c660-20ab-44b5-ab7b-5bc8cbf44d8b/serverless-cluster-pd.tidb-serverless.svc:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2023/06/02 03:21:46.624 +00:00] [WARN] [etcdutil.go:136] ["kv gets too slow"] [request-key=/ms/7187976276065784319/tso/00000/dc-location] [cost=10.000692425s] [error="context deadline exceeded"]
[2023/06/02 03:21:46.624 +00:00] [ERROR] [etcdutil.go:141] ["load from etcd meet error"] [key=/ms/7187976276065784319/tso/00000/dc-location] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded"]
[2023/06/02 03:21:46.624 +00:00] [ERROR] [allocator_manager.go:848] ["get cluster dc-locations from etcd failed"] [] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded"]
[2023/06/02 03:21:46.624 +00:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=endpoint://client-bf54c660-20ab-44b5-ab7b-5bc8cbf44d8b/serverless-cluster-pd.tidb-serverless.svc:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
**[2023/06/02 03:21:53.550 +00:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=endpoint://client-bf54c660-20ab-44b5-ab7b-5bc8cbf44d8b/serverless-cluster-pd.tidb-serverless.svc:2379] [attempt=0] [error="rpc error: code = Unavailable desc = transport is closing"]**
[2023/06/02 03:21:53.564 +00:00] [WARN] [leadership.go:194] ["required revision has been compacted, use the compact revision"] [required-revision=1800213] [compact-revision=2353567]
[2023/06/02 03:21:53.804 +00:00] [INFO] [leadership.go:211] ["current leadership is deleted"] [leader-key=/ms/7187976276065784319/tso/00000/primary] [purpose="keyspace group primary election"]
[2023/06/02 03:21:53.804 +00:00] [INFO] [global_allocator.go:553] ["the tso primary has changed, try to re-campaign a primary"] []
[2023/06/02 03:21:53.804 +00:00] [INFO] [global_allocator.go:562] ["start to campaign the primary"] [] [campaign-tso-primary-name=http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379-00000]
[2023/06/02 03:21:53.806 +00:00] [INFO] [lease.go:66] ["lease granted"] [lease-id=7473592164536214571] [lease-timeout=3] [purpose="keyspace group primary election"]
[2023/06/02 03:21:53.809 +00:00] [INFO] [leadership.go:122] ["check campaign resp"] [resp="{\"header\":{\"cluster_id\":10920548605668718190,\"member_id\":13485048042260555703,\"revision\":2481213,\"raft_term\":23},\"succeeded\":true,\"responses\":[{\"Response\":{\"ResponsePut\":{\"header\":{\"revision\":2481213}}}}]}"]
[2023/06/02 03:21:53.809 +00:00] [INFO] [leadership.go:131] ["write leaderData to leaderPath ok"] [leaderPath=/ms/7187976276065784319/tso/00000/primary] [purpose="keyspace group primary election"]
[2023/06/02 03:21:53.809 +00:00] [INFO] [global_allocator.go:595] ["campaign tso primary ok"] [] [campaign-tso-primary-name=http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379-00000]
[2023/06/02 03:21:53.809 +00:00] [INFO] [global_allocator.go:606] ["initializing the global tso allocator"]
[2023/06/02 03:21:53.809 +00:00] [INFO] [lease.go:143] ["start lease keep alive worker"] [interval=1s] [purpose="keyspace group primary election"]
[2023/06/02 03:21:53.931 +00:00] [INFO] [tso.go:185] ["sync and save timestamp"] [last=2023/06/02 03:06:29.715 +00:00] [save=2023/06/02 03:21:56.918 +00:00] [next=2023/06/02 03:21:53.918 +00:00]
[2023/06/02 03:21:53.931 +00:00] [INFO] [global_allocator.go:625] ["tso primary is ready to serve"] [] [tso-primary-name=http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379-00000]

tso_1 was the primary

[2023/06/02 03:06:29.675 +00:00] [INFO] [lease.go:126] ["lease timeout"] [expire=2023/06/02 03:06:31.673 +00:00] [purpose="keyspace group primary election"]
[2023/06/02 03:06:29.675 +00:00] [WARN] [lease.go:158] ["lease keep alive failed"] [purpose="keyspace group primary election"] [error="context canceled"]
[2023/06/02 03:06:29.675 +00:00] [INFO] [lease.go:172] ["stop lease keep alive worker"] [purpose="keyspace group primary election"]
[2023/06/02 03:06:29.675 +00:00] [WARN] [lease.go:158] ["lease keep alive failed"] [purpose="keyspace group primary election"] [error="context canceled"]
[2023/06/02 03:06:29.675 +00:00] [WARN] [lease.go:158] ["lease keep alive failed"] [purpose="keyspace group primary election"] [error="context canceled"]
[2023/06/02 03:06:30.999 +00:00] [ERROR] [register.go:82] ["keep alive failed"] [key=/ms/7187976276065784319/tso/registry/http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379]
[2023/06/02 03:06:31.681 +00:00] [INFO] [global_allocator.go:636] ["no longer a primary because lease has expired, the tso primary will step down"] []
[2023/06/02 03:06:32.682 +00:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=endpoint://client-13874eab-e8c6-4950-b46f-f708fdd4ca9d/serverless-cluster-pd.tidb-serverless.svc:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2023/06/02 03:06:32.682 +00:00] [INFO] [tso.go:384] ["reset the timestamp in memory"]
[2023/06/02 03:06:39.716 +00:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=endpoint://client-13874eab-e8c6-4950-b46f-f708fdd4ca9d/serverless-cluster-pd.tidb-serverless.svc:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2023/06/02 03:06:39.716 +00:00] [WARN] [etcdutil.go:136] ["kv gets too slow"] [request-key=/pd/7187976276065784319/timestamp] [cost=10.00049442s] [error="context deadline exceeded"]
[2023/06/02 03:06:39.716 +00:00] [ERROR] [etcdutil.go:141] ["load from etcd meet error"] [key=/pd/7187976276065784319/timestamp] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded"]
[2023/06/02 03:06:39.716 +00:00] [WARN] [tso.go:322] ["save timestamp failed"] [dc-location=global] [timestamp-path=timestamp] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded"]
[2023/06/02 03:06:39.716 +00:00] [WARN] [allocator_manager.go:801] ["failed to update allocator's timestamp"] [] [dc-location=global] [name=http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379-00000] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded"]
[2023/06/02 03:06:39.716 +00:00] [INFO] [tso.go:384] ["reset the timestamp in memory"]
[2023/06/02 03:06:42.683 +00:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=endpoint://client-13874eab-e8c6-4950-b46f-f708fdd4ca9d/serverless-cluster-pd.tidb-serverless.svc:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2023/06/02 03:06:42.683 +00:00] [WARN] [etcdutil.go:136] ["kv gets too slow"] [request-key=/ms/7187976276065784319/tso/00000/primary] [cost=10.001048119s] [error="context deadline exceeded"]
[2023/06/02 03:06:42.683 +00:00] [ERROR] [etcdutil.go:141] ["load from etcd meet error"] [key=/ms/7187976276065784319/tso/00000/primary] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded"]
[2023/06/02 03:06:42.683 +00:00] [ERROR] [participant.go:213] ["getting the leader meets error"] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded"]
...
repeated the 4 log lines every 10 or 11 seconds.
[2023/06/02 03:06:52.885 +00:00] [WARN] [etcdutil.go:136] ["kv gets too slow"] [request-key=/ms/7187976276065784319/tso/00000/primary] [cost=10.001082809s] [error="context deadline exceeded"]
[2023/06/02 03:06:52.885 +00:00] [ERROR] [etcdutil.go:141] ["load from etcd meet error"] [key=/ms/7187976276065784319/tso/00000/primary] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded"]
[2023/06/02 03:06:52.885 +00:00] [ERROR] [participant.go:213] ["getting the leader meets error"] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded"]
[2023/06/02 03:06:52.885 +00:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=endpoint://client-13874eab-e8c6-4950-b46f-f708fdd4ca9d/serverless-cluster-pd.tidb-serverless.svc:2379] [attempt=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
...
[2023/06/02 03:22:00.220 +00:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=endpoint://client-13874eab-e8c6-4950-b46f-f708fdd4ca9d/serverless-cluster-pd.tidb-serverless.svc:2379] [attempt=0] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/06/02 03:22:00.220 +00:00] [WARN] [retry_interceptor.go:62] ["retrying of unary invoker failed"] [target=endpoint://client-13874eab-e8c6-4950-b46f-f708fdd4ca9d/serverless-cluster-pd.tidb-serverless.svc:2379] [attempt=0] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/06/02 03:22:00.248 +00:00] [WARN] [etcdutil.go:136] ["kv gets too slow"] [request-key=/ms/7187976276065784319/tso/00000/primary] [cost=9.450242803s] []
[2023/06/02 03:22:00.249 +00:00] [INFO] [global_allocator.go:547] ["start to watch the primary"] [] [campaign-tso-primary-name=http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379-00000] [tso-primary="name:\"http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379-00000\" id:10272136419596665458 listen_urls:\"http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379\" "]

What did you expect to see? TSO primary election within 15 seconds

What did you see instead? TSO primary election took 14 minutes

What version of PD are you using (pd-server -V)? tidbcloud/pd-cse release-6.6-keyspace 9e1e2de5194655c2e1224ce94057e7f7fb5f7eb4 tikv/pd master

lhy1024 commented 1 year ago

[etcdutil.go:136] ["kv gets too slow"] [request-key=/ms/7187976276065784319/tso/00000/dc-location] [cost=10.000692425s]

Maybe etcd client still uses the old connection.

rleungx commented 1 year ago

@lhy1024 Any progress on this issue?

lhy1024 commented 1 year ago

@lhy1024 Any progress on this issue?

I am trying to reproduce it.

lhy1024 commented 1 year ago

This issue includes at least two parts, one of which is a watch issue for the secondary.

Regarding whether the delete event is missing, "current leadership is deleted" appears after "required revision has been compacted", indicating that the delete event is later than the previous watch 1800213. So the previous delete event did not miss.

However, we cannot exclude the watcher's handling of the compact revision. The pd log shows that the compact of 2353567 was executed at 2:26, the pd was restarted at 3:07, and the tso secondary received the compact from 1800213 to 2353567 only at 3:21. The compact case should have been received half an hour ago.

lhy1024 commented 1 year ago

The other part is about connections.

I found a similar issue where the request failed for 15 minutes after the pod update and resumed thereafter, the comments show it is related to TCP_USER_TIMEOUT.

Checking tcp_retries2 may not disconnect effectively when the keepalive mechanism is not triggered, the documentation says that the timeout may be between 13 and 30 minutes, and the documentation also says that TCP_USER_TIMEOUT can be be used to configure.

According to https://github.com/grpc/proposal/blob/master/A18-tcp-user-timeout.md, after https://github.com/grpc/grpc-go/pull/2307, TCP_USER_TIMEOUT can be configured via KeepaliveParams to configure TCP_USER_TIMEOUT.

But unfortunately, my local attempts with iptable and tcpkill did not reproduce this timeout.

lhy1024 commented 1 year ago

We can consider temporarily using multiendpoint + keepalive again to avoid this problem, but note that this will cause pd-leader-io-hang to always fail

In the meantime, we can also introduce withRequireLeader and the handling of closeErr, and revision = wresp.Header.Revision + 1, before investigating the watch problem in the secondary.

lhy1024 commented 1 year ago

This issue includes at least two parts, one of which is a watch issue for the secondary.

Regarding whether the delete event is missing, "current leadership is deleted" appears after "required revision has been compacted", indicating that the delete event is later than the previous watch 1800213. So the previous delete event did not miss.

However, we cannot exclude the watcher's handling of the compact revision. The pd log shows that the compact of 2353567 was executed at 2:26, the pd was restarted at 3:07, and the tso secondary received the compact from 1800213 to 2353567 only at 3:21. The compact case should have been received half an hour ago.

Maybe there is no any problem with handling of the compact revision, I check the log again.

[2023/06/01 13:53:24.673 +00:00] [INFO] [server.go:1786] ["update tso primary"] [primary=http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379]
[2023/06/01 13:53:24.673 +00:00] [INFO] [leadership.go:122] ["check campaign resp"] [resp="{\"header\":{\"cluster_id\":10920548605668718190,\"member_id\":13485048042260555703,\"revision\":1800213,\"raft_term\":20},\"succeeded\":true,\"responses\":[{\"Response\":{\"ResponsePut\":{\"header\":{\"revision\":1800213}}}}]}"]

**{....none "update tso primary" in log}**

[2023/06/01 15:26:50.423 +00:00] [INFO] [kvstore_compaction.go:56] ["finished scheduled compaction"] [compact-revision=1825171] [took=1.166314444s]

**{...some log about compact}**

[2023/06/02 02:26:50.588 +00:00] [INFO] [kvstore_compaction.go:56] ["finished scheduled compaction"] [compact-revision=2353567] [took=1.179783116s]

[2023/06/02 03:06:35.932 +00:00] [INFO] [versioninfo.go:89] ["Welcome to Placement Driver (API SERVICE)"](pd-0)
[2023/06/02 03:06:36.507 +00:00] [INFO] [versioninfo.go:89] ["Welcome to Placement Driver (API SERVICE)"](pd-2)
[2023/06/02 03:06:42.576 +00:00] [INFO] [server.go:1786] ["update tso primary"] [primary=http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379](pd-0)
[2023/06/02 03:06:49.009 +00:00] [INFO] [server.go:1786] ["update tso primary"] [primary=http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379](pd-2)

[2023/06/02 03:21:53.564 +00:00] [WARN] [leadership.go:194] ["required revision has been compacted, use the compact revision"] [required-revision=1800213] [compact-revision=2353567]

[2023/06/02 03:21:53.808 +00:00] [INFO] [server.go:1786] ["update tso primary"] [primary=http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379]

These logs show that at 13:53 tso last updated the leader to tso-1, some other key updates triggered multiple compacts, and at 03:06 pd restarted, which was still using tso-1 until the connection was re-elected later.

To prove this guess, I implemented a simple unit test that updates the other keys and then compacts them to see if the watcher receives the message.


// watcher watch "TestWatcherBreak"
// some other operator
suite.put("TestWatcherBreak", "3")

// put other key
for i := 0; i < 1000; i++ {
    suite.put(fmt.Sprintf("TestWatcherBreak/%d", i), fmt.Sprintf("4_%d", i))
}

// the rivison is greater than 1000
resp, err := EtcdKVGet(suite.client, "TestWatcherBreak")
suite.NoError(err)
suite.Greater(int(resp.Header.Revision), 1000)

// compact and watcher doesn't any message
revision := resp.Header.Revision
resp2, err := suite.etcd.Server.Compact(suite.ctx, &etcdserverpb.CompactionRequest{Revision: revision})
suite.NoError(err)
suite.Equal(revision, resp2.Header.Revision)

// put key and watcher update its rivision
time.Sleep(time.Second * 10)
suite.put("TestWatcherBreak", "5")

This test proves that if it is just compact, the watcher does not receive information until it updates the key it is watching.

lhy1024 commented 1 year ago

I also try to reproduct it in k8s

not graceful shutdown and restart it immediately, it can recover qps in 2 minutes
wait 10 minutes and restart and check qps after 2 minutes only one times qps drop 27% and meet two times tidb 8027 error

tikv / pd

TSO primary election took 14 minutes after PD(API) Pods being deleted at the same time and PD(API) leader being re-elected #6554

Enhancement Task