pingcap / tiflow

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)
Apache License 2.0
426 stars 284 forks source link

TiKV initial scan task not cancel/termicated when changefeed paused or failed #11638

Open fubinzh opened 2 weeks ago

fubinzh commented 2 weeks ago

What did you do?

  1. 07:51 - 07:55 Create 4 CDC clusters (each CDC cluster with 32C CDC nodes), for each CDC cluster create one changefeed, each changefeed replicating ~40k tables (overall ~160k tables)
  2. 07:58 Pause changefeed and update config
  3. 08:12 resume all changefeeds
  4. 09:38 Pause all changefeed
  5. 10:33 delete changefeed (Delete changefeeds from TiDB cloud UI)
  6. 10:07 - 10:33 manually kill and restart all TiKV

What did you expect to see?

What did you see instead?

After step 1, one changefeed initilization finished (all tables added the changefeed) After step 3, For the initilized changefeed, table count is 0 for 1.4h+ before it was paused. After step 4, initial scan pending tasks is decreasing, but it is very slow. After step 6, the pending tasks disappeared.

image image

Versions of the cluster

v6.5.9

fubinzh commented 2 weeks ago

Encountered similar issue when changefeed failed.

Changefeed failed at 10/03 00:00 during initial scan, but initial scan continues (and sorted disk actually increases) until I delete all changefeeds at 10/03 09:00.

image

image image

fubinzh commented 2 weeks ago

/severity major

3AceShowHand commented 1 week ago

Pause and remove changefeed, close the kv client, this should also close the gRPC connections to the TiKV-CDC, and cancel connections.

3AceShowHand commented 1 week ago

I am trying to reproduce this scenario on my local test environment, here is some observations:

On the TiCDC side:

On the TiKV side:

image
[2024/10/12 22:03:07.265 +08:00] [INFO] [middleware.go:48] [/api/v2/changefeeds/blackhole-test/resume] [status=200] [method=POST] [path=/api/v2/changefeeds/blackhole-test/resume] [query=] [ip=10.2.7.26] [user-agent=Go-http-client/1.1] [client-version=v6.5.9] [] [duration=146.190585ms]

[2024/10/12 22:04:38.215 +08:00] [INFO] [middleware.go:48] [/api/v1/changefeeds/blackhole-test/pause] [status=202] [method=POST] [path=/api/v1/changefeeds/blackhole-test/pause] [query=] [ip=10.2.7.26] [user-agent=Go-http-client/1.1] [client-version=v6.5.9] [] [duration=78.699473ms]

[2024/10/12 22:04:41.591 +08:00] [INFO] [client.go:613] ["region failed"]
3AceShowHand commented 1 week ago
image image image
3AceShowHand commented 1 week ago
[2024/10/02 01:38:47.996 +00:00] [INFO] [feed_state_manager.go:243] ["handle admin job"] [namespace=default] [changefeed=changefeed-1841264655229501440] [job="{\"CfID\":{\"Namespace\":\"default\",\"ID\":\"changefeed-1841264655229501440\"},\"Type\":1,\"Error\":null,\"OverwriteCheckpointTs\":0}"]

[2024/10/02 01:38:47.996 +00:00] [INFO] [ddl_puller.go:692] ["close the ddl puller"] [namespace=default] [changefeed=changefeed-1841264655229501440_ddl_puller]

[2024/10/02 01:38:48.091 +00:00] [INFO] [changefeed.go:723] ["changefeed closed"] [namespace=default] [changefeed=changefeed-1841264655229501440]

[2024/10/02 01:38:48.098 +00:00] [INFO] [manager.go:132] ["Closing source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440]

[2024/10/02 01:38:48.098 +00:00] [INFO] [manager.go:140] ["All pullers have been closed"] [namespace=default] [changefeed=changefeed-1841264655229501440] [cost=181ns]

[2024/10/02 01:38:48.098 +00:00] [INFO] [manager.go:147] ["Closed source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440] [cost=44.248µs]

From the logs above, we know that both changefeed and processor closed.

But there is still some logs indicates that the kv client fetch data. the region report error, and then restart to connect to that region.

[2024/10/02 01:38:48.099 +00:00] [INFO] [client.go:613] ["region failed"] [span="[748000fffffffffffffe5f728000000000ff1f11290000000000fa, 748000fffffffffffffe5f728000000000ff1f1a570000000000fa)"] [regionId=1504] [error="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting"]
[2024/10/02 01:38:48.146 +00:00] [INFO] [client.go:1462] ["event feed starts to check locked regions"] [namespace=default] [changefeed=changefeed-1841264655229501440] [tableID=67502] 
...

This logs indicates that the processor is blocked for a long time, this may be the root cause of the issue.


[2024/10/01 23:59:09.085 +00:00] [WARN] [client.go:271] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=10.00036425s] [role=processor]

[2024/10/02 01:38:48.093 +00:00] [WARN] [client.go:271] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=1h39m44.999948235s] [role=processor]
[2024/10/02 01:38:48.194 +00:00] [INFO] [puller.go:293] ["table puller has been stucked"] [namespace=default] [changefeed=changefeed-1841264655229501440] [tableID=82046] [tableName=`db_218`.`dh_pay_customer_0`] [resolvedTs=452818737561600000] [slowestRangeTs=452818737561600000] [range="[7480000000000140ff7e5f72ffffffffffffffffff0000000000fb, 7480000000000140ff7e5f730000000000fa)"]

[2024/10/02 00:00:58.937 +00:00] [INFO] [client.go:1485] ["event feed finds a uninitialized slow region"] [namespace=default] [changefeed=changefeed-1841264655229501440] [tableID=22030] [tableName=`db_134`.`dh_user_withdraw_order_0`] [slowRegion="{\"RegionID\":2176379,\"CheckpointTs\":452818737561600000,\"Initialized\":false,\"Created\":\"2024-10-01T23:54:58.945274175Z\"}"]
3AceShowHand commented 1 week ago
[2024/10/01 23:58:50.650 +00:00] [INFO] [processor.go:1242] ["Processor try to close source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440]
[2024/10/01 23:58:50.650 +00:00] [INFO] [manager.go:132] ["Closing source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440]
...
[2024/10/02 00:20:13.166 +00:00] [INFO] [manager.go:140] ["All pullers have been closed"] [namespace=default] [changefeed=changefeed-1841264655229501440] [cost=21m22.516489165s]

There is one log indicates that close the puller takes more than 21 minutes. another processor does not have such logs, and blocked for more than 2.5 hours.

[2024/10/02 02:32:39.092 +00:00] [WARN] [client.go:271] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=2h33m35.999487376s] [role=processor]
3AceShowHand commented 1 week ago

The root cause of the issue comes from the kv client cannot unregister the event handle from the worker pool immediately.

When close the puller, kv client also close, it has to unregister the event handler from the worker pool first, this cost too much time, makes each puller cannot be closed immediately. When there are a lot of tables, each table has one puller, so that some puller not closed, still works normally.

This issue should not exist after v7.5, since the kv client was refactored, workerpool is no more used.

[2024/10/02 02:04:51.764 +00:00] [WARN] [pool_impl.go:444] ["synchronize is taking too long, report a bug"] [duration=10.000434501s]