stale lock during tidb rolling restart results in resolved ts lag and cdc lag increase

fubinzh commented 5 months ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

TiDB cluster with titan on, cluster with 30 TiKV nodes, 10 TiDB nodes, cluster size: 70TB, througtput ~40MB/s 3 CDC changefeed running to sync 4000 tables.
Rolling restart TiDB

2. What did you expect to see? (Required)

CDC lag should not be <10s

3. What did you see instead (Required)

TiKV resolved ts lag increases and cdc lag increases as a results.

CDC log indicates that cdc tries to resolved lock.

[2024/03/26 07:56:35.689 +00:00] [INFO] [lock_resolver.go:56] ["resolve lock starts"] [regionID=751118934] [maxVersion=448643672832999424] [namespace=default] [changefeed=jxxu-all]
[2024/03/26 08:02:25.688 +00:00] [INFO] [lock_resolver.go:56] ["resolve lock starts"] [regionID=317512213] [maxVersion=448643763796967424] [namespace=default] [changefeed=jxxu-all]

4. What is your TiDB version? (Required)

Release Version: v8.0.0-alpha-629-g75c8347
Edition: Community
Git Commit Hash: 75c834728e82f9efe7cc12263ff7d33cd810acbf
Git Branch: HEAD
UTC Build Time: 2024-03-26 06:01:54
GoVersion: go1.21.4
Race Enabled: false
Check Table Before Drop: false
Store: unistore

fubinzh commented 4 months ago

/remove-type question /type bug

fubinzh commented 4 months ago

/severity major

fubinzh commented 4 months ago

/assign @bb7133

YangKeao commented 3 weeks ago

/assign @YangKeao

After doing some experiments and investigating the logic of TiDB's graceful shutdown, here is a brief introduction to the routine and how to avoid this issue. The shutdown process will have three stages:

TiDB announces itself as unhealthy. Then LB will not create new connections on it. This stage will last for graceful-wait-before-shutdown seconds (by default, 0). During this stage, TiDB itself will still be able to accept new connections, execute new queries / transactions.
Stop the listener, enter the shutdown-mode. In this stage, TiDB cannot accept new connections, and cannot begin new transactions or execute new commands (including PREPARE/EXECUTE/QUERY/..., all commands). The client will get an error if it runs new commands on the existing connection. It'll last for 15s, not configurable.
Kill all queries, and then stop the internal services, finally stop the process.

If we don't expect to leak locks, we can have two choices:

Avoid killing queries / transactions. Let these running queries / transactions to stop by themselves.
Avoid leaking locks even when the queries / transactions are killed.

The second goal is relatively hard to reach. However, the first one should be able to handle most of the cases.

For short connection scenario, we can increase graceful-wait-before-shutdown. Therefore, all connections will end their lifetime during the stage 1, and this tidb will finally have no connections, then no lock will be leaked.
For long connection scenario and small transactions, all transactions / running queries will finish normally (ending with COMMIT or ROLLBACK), and return an error to the client. TiDB will also have no connections (at least, no running transactions), so no lock will leak during reboot. However, there is a tiny issue that the auto-commit queries are not waited https://github.com/pingcap/tidb/issues/55464. I'm fixing it right now :beers: .
For long connection (> graceful-wait-before-shutdown) and big transactions (> 15s), TiDB cannot do well now :facepalm:. It's still under investigation and I'm not sure how difficult it is to fix it.

And finally, no matter which options you choose, make sure to also increase terminationGracePeriodSeconds if you are using kubernetes, or TiDB will not have enough time to graceful shutdown.

YangKeao commented 2 weeks ago

After fixing the auto-commit issues, I found that the background async-commit goroutines are not waited. Therefore, if most of the workload is async-commit transactions, TiDB may exit too early and kill background async-commit goroutines and leak the lock. (Also, committing secondary keys happens in background, so if there are many transactions with more than one keys, it'll cause similar issues).

I've submitted PRs to wait for the goroutines which are used to async commit or commit secondary keys: ref https://github.com/tikv/client-go/pull/1432 and https://github.com/pingcap/tidb/pull/55608

pingcap / tidb