Issue: Long SQL Causing TiKV gRPC to Hang

WalterWj commented 2 months ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Background:

The SQL query is select * from t where id in (?), with many values in the IN clause, such as 10,000 entries. The SQL length is several MBs. During execution, TiKV experiences leader drop incidents, and monitoring shows server report failures. There is a noticeable leader drop.

After optimizing the SQL by reducing the contents of the IN clause, the issue no longer occurs. This effectively reduces the request size between TiDB and TiKV.

Problem Description:

This is a known issue. During interactions between TiDB and TiKV, if the request is too large, TiKV's gRPC threads become busy with frequent deserialization of large messages, causing the threads to hang. This affects the sending of other messages, such as heartbeat packets between the leader and followers of a Region.

This issue is caused by business reasons and can be reproduced in older versions, indicating that it exists across versions. The upgrade triggered this phenomenon. Many factors, including data and leader distribution, can affect the requests sent to TiKV, making it difficult to pinpoint specific changes. Restarting might also be a factor. However, the root cause is clear: this is an edge case. Future considerations may include enhancements to address it.

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

seiya-annie commented 2 months ago

/report customer

LykxSassinator commented 1 week ago

As previously noted, this scenario represents a corner case that highlights the importance of users selecting an appropriate request size.

Given that, it is more appropriate to label this issue as an enhancement for tracking the enhancement of tackling this case both on TiKV and TiDB sides.

pingcap / tidb