Open fubinzh opened 5 months ago
/remove-type question /type bug
/severity major
/assign @bb7133
/assign @YangKeao
After doing some experiments and investigating the logic of TiDB's graceful shutdown, here is a brief introduction to the routine and how to avoid this issue. The shutdown process will have three stages:
graceful-wait-before-shutdown
seconds (by default, 0). During this stage, TiDB itself will still be able to accept new connections, execute new queries / transactions.shutdown-mode
. In this stage, TiDB cannot accept new connections, and cannot begin new transactions or execute new commands (including PREPARE/EXECUTE/QUERY/..., all commands). The client will get an error if it runs new commands on the existing connection. It'll last for 15s, not configurable.If we don't expect to leak locks, we can have two choices:
The second goal is relatively hard to reach. However, the first one should be able to handle most of the cases.
graceful-wait-before-shutdown
. Therefore, all connections will end their lifetime during the stage 1, and this tidb will finally have no connections, then no lock will be leaked.COMMIT
or ROLLBACK
), and return an error to the client. TiDB will also have no connections (at least, no running transactions), so no lock will leak during reboot.
However, there is a tiny issue that the auto-commit queries are not waited https://github.com/pingcap/tidb/issues/55464. I'm fixing it right now :beers: .> graceful-wait-before-shutdown
) and big transactions (> 15s
), TiDB cannot do well now :facepalm:. It's still under investigation and I'm not sure how difficult it is to fix it.And finally, no matter which options you choose, make sure to also increase terminationGracePeriodSeconds
if you are using kubernetes, or TiDB will not have enough time to graceful shutdown.
After fixing the auto-commit issues, I found that the background async-commit goroutines are not waited. Therefore, if most of the workload is async-commit transactions, TiDB may exit too early and kill background async-commit goroutines and leak the lock. (Also, committing secondary keys happens in background, so if there are many transactions with more than one keys, it'll cause similar issues).
I've submitted PRs to wait for the goroutines which are used to async commit or commit secondary keys: ref https://github.com/tikv/client-go/pull/1432 and https://github.com/pingcap/tidb/pull/55608
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
2. What did you expect to see? (Required)
CDC lag should not be <10s
3. What did you see instead (Required)
TiKV resolved ts lag increases and cdc lag increases as a results.
CDC log indicates that cdc tries to resolved lock.
4. What is your TiDB version? (Required)