Open fubinzh opened 2 weeks ago
Encountered similar issue when changefeed failed.
Changefeed failed at 10/03 00:00 during initial scan, but initial scan continues (and sorted disk actually increases) until I delete all changefeeds at 10/03 09:00.
/severity major
Pause and remove changefeed, close the kv client, this should also close the gRPC connections to the TiKV-CDC, and cancel connections.
I am trying to reproduce this scenario on my local test environment, here is some observations:
On the TiCDC side:
region failed
error, the last in 22:04:41:591. Which means for the KV client to fully closed, it takes around 3 seconds. There are 14954 errors, handled in 3 seconds, around 5000 / s.On the TiKV side:
cdc register region
found, a lot of regions registered. [2024/10/12 22:03:07.265 +08:00] [INFO] [middleware.go:48] [/api/v2/changefeeds/blackhole-test/resume] [status=200] [method=POST] [path=/api/v2/changefeeds/blackhole-test/resume] [query=] [ip=10.2.7.26] [user-agent=Go-http-client/1.1] [client-version=v6.5.9] [] [duration=146.190585ms]
[2024/10/12 22:04:38.215 +08:00] [INFO] [middleware.go:48] [/api/v1/changefeeds/blackhole-test/pause] [status=202] [method=POST] [path=/api/v1/changefeeds/blackhole-test/pause] [query=] [ip=10.2.7.26] [user-agent=Go-http-client/1.1] [client-version=v6.5.9] [] [duration=78.699473ms]
[2024/10/12 22:04:41.591 +08:00] [INFO] [client.go:613] ["region failed"]
[2024/10/02 01:38:47.996 +00:00] [INFO] [feed_state_manager.go:243] ["handle admin job"] [namespace=default] [changefeed=changefeed-1841264655229501440] [job="{\"CfID\":{\"Namespace\":\"default\",\"ID\":\"changefeed-1841264655229501440\"},\"Type\":1,\"Error\":null,\"OverwriteCheckpointTs\":0}"]
[2024/10/02 01:38:47.996 +00:00] [INFO] [ddl_puller.go:692] ["close the ddl puller"] [namespace=default] [changefeed=changefeed-1841264655229501440_ddl_puller]
[2024/10/02 01:38:48.091 +00:00] [INFO] [changefeed.go:723] ["changefeed closed"] [namespace=default] [changefeed=changefeed-1841264655229501440]
[2024/10/02 01:38:48.098 +00:00] [INFO] [manager.go:132] ["Closing source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440]
[2024/10/02 01:38:48.098 +00:00] [INFO] [manager.go:140] ["All pullers have been closed"] [namespace=default] [changefeed=changefeed-1841264655229501440] [cost=181ns]
[2024/10/02 01:38:48.098 +00:00] [INFO] [manager.go:147] ["Closed source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440] [cost=44.248µs]
From the logs above, we know that both changefeed and processor closed.
But there is still some logs indicates that the kv client fetch data. the region report error, and then restart to connect to that region.
[2024/10/02 01:38:48.099 +00:00] [INFO] [client.go:613] ["region failed"] [span="[748000fffffffffffffe5f728000000000ff1f11290000000000fa, 748000fffffffffffffe5f728000000000ff1f1a570000000000fa)"] [regionId=1504] [error="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting"]
[2024/10/02 01:38:48.146 +00:00] [INFO] [client.go:1462] ["event feed starts to check locked regions"] [namespace=default] [changefeed=changefeed-1841264655229501440] [tableID=67502]
...
This logs indicates that the processor is blocked for a long time, this may be the root cause of the issue.
[2024/10/01 23:59:09.085 +00:00] [WARN] [client.go:271] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=10.00036425s] [role=processor]
[2024/10/02 01:38:48.093 +00:00] [WARN] [client.go:271] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=1h39m44.999948235s] [role=processor]
[2024/10/02 01:38:48.194 +00:00] [INFO] [puller.go:293] ["table puller has been stucked"] [namespace=default] [changefeed=changefeed-1841264655229501440] [tableID=82046] [tableName=`db_218`.`dh_pay_customer_0`] [resolvedTs=452818737561600000] [slowestRangeTs=452818737561600000] [range="[7480000000000140ff7e5f72ffffffffffffffffff0000000000fb, 7480000000000140ff7e5f730000000000fa)"]
[2024/10/02 00:00:58.937 +00:00] [INFO] [client.go:1485] ["event feed finds a uninitialized slow region"] [namespace=default] [changefeed=changefeed-1841264655229501440] [tableID=22030] [tableName=`db_134`.`dh_user_withdraw_order_0`] [slowRegion="{\"RegionID\":2176379,\"CheckpointTs\":452818737561600000,\"Initialized\":false,\"Created\":\"2024-10-01T23:54:58.945274175Z\"}"]
[2024/10/01 23:58:50.650 +00:00] [INFO] [processor.go:1242] ["Processor try to close source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440]
[2024/10/01 23:58:50.650 +00:00] [INFO] [manager.go:132] ["Closing source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440]
...
[2024/10/02 00:20:13.166 +00:00] [INFO] [manager.go:140] ["All pullers have been closed"] [namespace=default] [changefeed=changefeed-1841264655229501440] [cost=21m22.516489165s]
There is one log indicates that close the puller takes more than 21 minutes. another processor does not have such logs, and blocked for more than 2.5 hours.
[2024/10/02 02:32:39.092 +00:00] [WARN] [client.go:271] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=2h33m35.999487376s] [role=processor]
The root cause of the issue comes from the kv client cannot unregister the event handle from the worker pool immediately.
When close the puller, kv client also close, it has to unregister the event handler from the worker pool first, this cost too much time, makes each puller cannot be closed immediately. When there are a lot of tables, each table has one puller, so that some puller not closed, still works normally.
This issue should not exist after v7.5, since the kv client was refactored, workerpool is no more used.
[2024/10/02 02:04:51.764 +00:00] [WARN] [pool_impl.go:444] ["synchronize is taking too long, report a bug"] [duration=10.000434501s]
What did you do?
What did you expect to see?
What did you see instead?
After step 1, one changefeed initilization finished (all tables added the changefeed) After step 3, For the initilized changefeed, table count is 0 for 1.4h+ before it was paused. After step 4, initial scan pending tasks is decreasing, but it is very slow. After step 6, the pending tasks disappeared.
Versions of the cluster
v6.5.9