TiKV initial scan task not cancel/termicated when changefeed paused or failed

fubinzh commented 2 weeks ago

What did you do?

07:51 - 07:55 Create 4 CDC clusters (each CDC cluster with 32C CDC nodes), for each CDC cluster create one changefeed, each changefeed replicating ~40k tables (overall ~160k tables)
07:58 Pause changefeed and update config
08:12 resume all changefeeds
09:38 Pause all changefeed
10:33 delete changefeed (Delete changefeeds from TiDB cloud UI)
10:07 - 10:33 manually kill and restart all TiKV

What did you expect to see?

When task paused, previous pending initial scan task should be cancelled, Otherwise this might block following changefeed initilization.

What did you see instead?

After step 1, one changefeed initilization finished (all tables added the changefeed) After step 3, For the initilized changefeed, table count is 0 for 1.4h+ before it was paused. After step 4, initial scan pending tasks is decreasing, but it is very slow. After step 6, the pending tasks disappeared.

Versions of the cluster

v6.5.9

fubinzh commented 2 weeks ago

Encountered similar issue when changefeed failed.

Changefeed failed at 10/03 00:00 during initial scan, but initial scan continues (and sorted disk actually increases) until I delete all changefeeds at 10/03 09:00.

fubinzh commented 2 weeks ago

/severity major

3AceShowHand commented 1 week ago

Pause and remove changefeed, close the kv client, this should also close the gRPC connections to the TiKV-CDC, and cancel connections.

3AceShowHand commented 1 week ago

I am trying to reproduce this scenario on my local test environment, here is some observations:

There are more than 100,000 regions in the TiKV cluster.

On the TiCDC side:

Resume the changefeed in around 22:03:07:265. connect to the TiKV, start incremental scan tasks.
Pause the changefeed in around 22:04:38.215. KV client receive a lot of region failed error, the last in 22:04:41:591. Which means for the KV client to fully closed, it takes around 3 seconds. There are 14954 errors, handled in 3 seconds, around 5000 / s.

On the TiKV side:

cdc register region found, a lot of regions registered.
The TiKV-CDC incremental scan workers thread count is default to 4, and incremental scan concurrency is 6. For each region, they are blocked on the incremental acquire the concurrency_semaphore
When pause the changefeed, TiKV-CDC deregister the connection, stop observing regions, cost around 3s also.
- stop observe regions, also set the downstream state to stop, so the incremental scan can be canceld.

[2024/10/12 22:03:07.265 +08:00] [INFO] [middleware.go:48] [/api/v2/changefeeds/blackhole-test/resume] [status=200] [method=POST] [path=/api/v2/changefeeds/blackhole-test/resume] [query=] [ip=10.2.7.26] [user-agent=Go-http-client/1.1] [client-version=v6.5.9] [] [duration=146.190585ms]

[2024/10/12 22:04:38.215 +08:00] [INFO] [middleware.go:48] [/api/v1/changefeeds/blackhole-test/pause] [status=202] [method=POST] [path=/api/v1/changefeeds/blackhole-test/pause] [query=] [ip=10.2.7.26] [user-agent=Go-http-client/1.1] [client-version=v6.5.9] [] [duration=78.699473ms]

[2024/10/12 22:04:41.591 +08:00] [INFO] [client.go:613] ["region failed"]

3AceShowHand commented 1 week ago

The changefeed paused around 09:40

The event feed count is not dropped to 0 immediately

KV Client still receive a lot of kv entries

3AceShowHand commented 1 week ago

[2024/10/02 01:38:47.996 +00:00] [INFO] [feed_state_manager.go:243] ["handle admin job"] [namespace=default] [changefeed=changefeed-1841264655229501440] [job="{\"CfID\":{\"Namespace\":\"default\",\"ID\":\"changefeed-1841264655229501440\"},\"Type\":1,\"Error\":null,\"OverwriteCheckpointTs\":0}"]

[2024/10/02 01:38:47.996 +00:00] [INFO] [ddl_puller.go:692] ["close the ddl puller"] [namespace=default] [changefeed=changefeed-1841264655229501440_ddl_puller]

[2024/10/02 01:38:48.091 +00:00] [INFO] [changefeed.go:723] ["changefeed closed"] [namespace=default] [changefeed=changefeed-1841264655229501440]

[2024/10/02 01:38:48.098 +00:00] [INFO] [manager.go:132] ["Closing source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440]

[2024/10/02 01:38:48.098 +00:00] [INFO] [manager.go:140] ["All pullers have been closed"] [namespace=default] [changefeed=changefeed-1841264655229501440] [cost=181ns]

[2024/10/02 01:38:48.098 +00:00] [INFO] [manager.go:147] ["Closed source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440] [cost=44.248µs]

From the logs above, we know that both changefeed and processor closed.

But there is still some logs indicates that the kv client fetch data. the region report error, and then restart to connect to that region.

[2024/10/02 01:38:48.099 +00:00] [INFO] [client.go:613] ["region failed"] [span="[748000fffffffffffffe5f728000000000ff1f11290000000000fa, 748000fffffffffffffe5f728000000000ff1f1a570000000000fa)"] [regionId=1504] [error="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting"]
[2024/10/02 01:38:48.146 +00:00] [INFO] [client.go:1462] ["event feed starts to check locked regions"] [namespace=default] [changefeed=changefeed-1841264655229501440] [tableID=67502] 
...

This logs indicates that the processor is blocked for a long time, this may be the root cause of the issue.


[2024/10/01 23:59:09.085 +00:00] [WARN] [client.go:271] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=10.00036425s] [role=processor]

[2024/10/02 01:38:48.093 +00:00] [WARN] [client.go:271] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=1h39m44.999948235s] [role=processor]
[2024/10/02 01:38:48.194 +00:00] [INFO] [puller.go:293] ["table puller has been stucked"] [namespace=default] [changefeed=changefeed-1841264655229501440] [tableID=82046] [tableName=`db_218`.`dh_pay_customer_0`] [resolvedTs=452818737561600000] [slowestRangeTs=452818737561600000] [range="[7480000000000140ff7e5f72ffffffffffffffffff0000000000fb, 7480000000000140ff7e5f730000000000fa)"]

[2024/10/02 00:00:58.937 +00:00] [INFO] [client.go:1485] ["event feed finds a uninitialized slow region"] [namespace=default] [changefeed=changefeed-1841264655229501440] [tableID=22030] [tableName=`db_134`.`dh_user_withdraw_order_0`] [slowRegion="{\"RegionID\":2176379,\"CheckpointTs\":452818737561600000,\"Initialized\":false,\"Created\":\"2024-10-01T23:54:58.945274175Z\"}"]

3AceShowHand commented 1 week ago

[2024/10/01 23:58:50.650 +00:00] [INFO] [processor.go:1242] ["Processor try to close source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440]
[2024/10/01 23:58:50.650 +00:00] [INFO] [manager.go:132] ["Closing source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440]
...
[2024/10/02 00:20:13.166 +00:00] [INFO] [manager.go:140] ["All pullers have been closed"] [namespace=default] [changefeed=changefeed-1841264655229501440] [cost=21m22.516489165s]

There is one log indicates that close the puller takes more than 21 minutes. another processor does not have such logs, and blocked for more than 2.5 hours.

[2024/10/02 02:32:39.092 +00:00] [WARN] [client.go:271] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=2h33m35.999487376s] [role=processor]

3AceShowHand commented 1 week ago

The root cause of the issue comes from the kv client cannot unregister the event handle from the worker pool immediately.

When close the puller, kv client also close, it has to unregister the event handler from the worker pool first, this cost too much time, makes each puller cannot be closed immediately. When there are a lot of tables, each table has one puller, so that some puller not closed, still works normally.

This issue should not exist after v7.5, since the kv client was refactored, workerpool is no more used.

[2024/10/02 02:04:51.764 +00:00] [WARN] [pool_impl.go:444] ["synchronize is taking too long, report a bug"] [duration=10.000434501s]

pingcap / tiflow