pingcap / tidb

TiDB - the open-source, cloud-native, distributed SQL database designed for modern applications.
https://pingcap.com
Apache License 2.0
37.26k stars 5.84k forks source link

There is a 7% performance regression in Taobench benchmark #51852

Open Yui-Song opened 7 months ago

Yui-Song commented 7 months ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. deploy a tidb cluster with 3 tidb + 3 tikv
  2. run the workload_a of taobench benchmark

2. What did you expect to see? (Required)

No performance regression

3. What did you see instead (Required)

Update on 2024-07-09: Workloads like select_random_ranges/select_random_points involving many coprocessor operations are also affected by the overhead caused by TiDB runtime events mentioned below.

The QPS of taobench: baseline: v7.5.0, QPS = 27240

4. What is your TiDB version? (Required)

Yui-Song commented 7 months ago

/type performance /type regression /sig execution /severity critical

Yui-Song commented 7 months ago

/remove-label may-affects-7.5 /remove-label may-affects-7.1 /remove-label may-affects-6.5 /remove-label may-affects-6.1 /remove-label may-affects-5.4

XuHuaiyu commented 7 months ago

Xa3PGIaun0 Comparing the CPU profiles before and after, there's a significant difference in the SendReqCtx section, especially with a noticeable increase in the proportion of CPU overhead attributed to sync.Map.Load().

XuHuaiyu commented 7 months ago

This seems to be a client-go related problem, I'll change the sig label from sig/execution to sig/transaction

zyguan commented 7 months ago

The overhead seems to be caused by some kinds of runtime events (eg. some assist works). It can happend in any where (not just SendReqCtx -> updateTiKVSendReqHistogram -> runtime.newstack), and we cannot reproduce the issue (high overhead of updateTiKVSendReqHistogram). Thus the root cause might not be related to SendReqCtx directly, further investigation is required.

Yui-Song commented 7 months ago

https://github.com/pingcap/tidb/pull/50650 caused a regression in compile duration which resulted in a 1.5% QPS regression of taobench img_v3_029b_65e9b88e-79e7-41b4-8453-c6161a68920g

Yui-Song commented 7 months ago

https://github.com/pingcap/tidb/pull/49900 caused a 2% QPS regression of taobench img_v3_029c_5794b794-d013-4358-83f1-d36e5d87bbbg

tiancaiamao commented 6 months ago

The overhead seems to be caused by some kinds of runtime events (eg. some assist works). It can happend in any where (not just SendReqCtx -> updateTiKVSendReqHistogram -> runtime.newstack), and we cannot reproduce the issue (high overhead of updateTiKVSendReqHistogram). Thus the root cause might not be related to SendReqCtx directly, further investigation is required.

Once I write a blog about how to handle this kind of issue, but the domain service provider is down, so zenlife.tk is not available any more. Here are some links maybe related. https://github.com/tiancaiamao/gp
http://107.173.155.134:8080/goroutine-pool.md

hawkingrei commented 5 months ago

The overhead seems to be caused by some kinds of runtime events (eg. some assist works). It can happend in any where (not just SendReqCtx -> updateTiKVSendReqHistogram -> runtime.newstack), and we cannot reproduce the issue (high overhead of updateTiKVSendReqHistogram). Thus the root cause might not be related to SendReqCtx directly, further investigation is required.

Once I write a blog about how to handle this kind of issue, but the domain service provider is down, so zenlife.tk is not available any more. Here are some links maybe related. https://github.com/tiancaiamao/gp http://107.173.155.134:8080/goroutine-pool.md

@you06 is working to improve it by the global goroutine pool.

https://github.com/pingcap/tidb/pull/53299

Yui-Song commented 4 months ago

/unassign @bb7133 /assign @you06