Open Tema opened 1 month ago
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign jackysp for approval. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
@Tema Thanks helping with the improvements.
It's ok to introduce a rate-limiting mechanism for the kv-client. Finding the optimal balance between error handling, retry success, and avoiding overloading PD could be challenging, or selecting a suitable default value that works for most scenarios.
Another approach is similar to TiKV health control feedback (as discussed in tikv/tikv#16297), where some processing capacity information is carried in PD responses and fed back to the KV client. Based on this feedback, the kv-client can then decide its concurrency control and rate-limiting strategy accordingly.
Thanks cfzjywxk for the comment. I think it is not always possible for PD to reply to provide this information to TiDB in case it is completely overloaded. https://github.com/tikv/pd/issues/8678 proposes a more sophisticated solution to cover that case as well. Maybe it is worth to see if it could be incorporated into https://github.com/tikv/tikv/issues/16297 which you mentioned. Anyways, these referenced solutions looks too heavy and take some time to productionize. While this PR is more like a simple stop bleeding fix to prevent this problem asap.
Current backoff policy does not help much to limit TiDB reties to retrieve Region from PD when there are issues with Region Metadata in PD:
This PR adds an ability to configure global retry limiter to Backoff function per each config. It also creates a new Backoff config dedicated to PD Region Metadata calls which will be used in TiDB in separate PR:
The config above allows a single retry per each 10 previous successful call (0.1), but limit overall retry budget to 10. It always start with full budget of retries.