pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
36.94k stars 5.81k forks source link

the memory of tidb keep increasing and qps drop more than 60% after simulate pd leader network partition #45045

Open Lily2025 opened 1 year ago

Lily2025 commented 1 year ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

1、start pitr task and ticdc changefeed task 2、run tpcc (1000 warehouse) 3、after 10min, simulate pd leader network partition (pd elect normally)

2. What did you expect to see? (Required)

1、the memory of tidb shoule not keep increasing 2、qps can recovery with in 2min after simulate pd leader network partition

3. What did you see instead (Required)

1、the memory of tidb keep increasing image [2023/06/28 21:40:21.394 +08:00] [WARN] [servermemorylimit.go:148] ["global memory controller tries to kill the top1 memory consumer"] [conn=83760795803975679] ["sql digest"=bc2d161067078a11d7fc1e202182dd6eb547f1f5a2ecc14998779df9c3131e70] ["sql text"="analyze table tpcc.history"] [tidb_server_memory_limit=27487790640] ["heap inuse"=27493703680] ["sql memory usage"=19233776553] [2023/06/28 21:40:22.355 +08:00] [WARN] [tracker.go:464] ["global memory controller, NeedKill signal is received successfully"] [conn=83760795803975679] [2023/06/28 21:40:22.355 +08:00] [WARN] [expensivequery.go:145] ["memory exceeds quota"] [cost_time=938.657750086s] [conn=83760795803975679] [txn_start_ts=0] [mem_max="19338634209 Bytes (18.0 GB)"] [sql="analyze table tpcc.history"] [2023/06/28 21:40:22.355 +08:00] [ERROR] [analyze_col_v2.go:645] ["analyze worker panicked"] [recover="Your query has been cancelled due to exceeding the allowed memory limit for the tidb-server instance and this query is currently using the most memory. Please try narrowing your query scope or increase the tidb_server_memory_limit and try again.[conn=83760795803975679]"] [stack="github.com/pingcap/tidb/executor.(AnalyzeColumnsExecV2).subBuildWorker.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/analyze_col_v2.go:645\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:884\ngithub.com/pingcap/tidb/util/memory.(PanicOnExceed).Action\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/action.go:167\ngithub.com/pingcap/tidb/util/memory.(Tracker).Consume.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:455\ngithub.com/pingcap/tidb/util/memory.(Tracker).Consume\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:467\ngithub.com/pingcap/tidb/statistics.BuildHistAndTopN.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/statistics/builder.go:246\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:884\ngithub.com/pingcap/tidb/util/memory.(PanicOnExceed).Action\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/action.go:167\ngithub.com/pingcap/tidb/util/memory.(Tracker).Consume.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:455\ngithub.com/pingcap/tidb/util/memory.(Tracker).Consume\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:467\ngithub.com/pingcap/tidb/util/memory.(Tracker).BufferedConsume\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:497\ngithub.com/pingcap/tidb/statistics.BuildHistAndTopN.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/statistics/builder.go:257\ngithub.com/pingcap/tidb/statistics.BuildHistAndTopN\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/statistics/builder.go:371\ngithub.com/pingcap/tidb/executor.(AnalyzeColumnsExecV2).subBuildWorker\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/analyze_col_v2.go:779\ngithub.com/pingcap/tidb/executor.(AnalyzeColumnsExecV2).buildSamplingStats.func3\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/analyze_col_v2.go:348\ngithub.com/pingcap/tidb/executor.(notifyErrorWaitGroupWrapper).Run.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/analyze_utils.go:127"] [2023/06/28 21:40:22.355 +08:00] [ERROR] [analyze.go:311] ["analyze failed"] [error="Your query has been cancelled due to exceeding the allowed memory limit for the tidb-server instance and this query is currently using the most memory. Please try narrowing your query scope or increase the tidb_server_memory_limit and try again.[conn=83760795803975679]%!(EXTRA errors.fundamental=analyze panic due to memory quota exceeds, please try with smaller samplerate(refer to 110000/count))"] [2023/06/28 21:40:22.359 +08:00] [INFO] [analyze.go:589] ["analyze table tpcc.history has failed"] [partition=] ["job info"="auto analyze table all columns with 256 buckets, 500 topn, 0.3516556587789237 samplerate"] ["start time"=2023/06/28 21:24:53.839 +08:00] ["end time"=2023/06/28 21:40:22.355 +08:00] [cost=15m28.51620183s] [2023/06/28 21:40:22.359 +08:00] [INFO] [tidb.go:285] ["rollbackTxn called due to ddl/autocommit failure"] [2023/06/28 21:40:22.359 +08:00] [WARN] [session.go:2284] ["run statement failed"] [schemaVersion=48] [error="Your query has been cancelled due to exceeding the allowed memory limit for the tidb-server instance and this query is currently using the most memory. Please try narrowing your query scope or increase the tidb_server_memory_limit and try again.[conn=83760795803975679]%!(EXTRA errors.fundamental=analyze panic due to memory quota exceeds, please try with smaller samplerate(refer to 110000/count))"] [session="{\n \"currDBName\": \"\",\n \"id\": 83760795803975679,\n \"status\": 2,\n \"strictMode\": true,\n \"user\": null\n}"] [2023/06/28 21:40:22.359 +08:00] [ERROR] [update.go:1315] ["[stats] auto analyze failed"] [sql="analyze table tpcc.history"] [cost_time=15m38.661931978s] [error="Your query has been cancelled due to exceeding the allowed memory limit for the tidb-server instance and this query is currently using the most memory. Please try narrowing your query scope or increase the tidb_server_memory_limit and try again.[conn=83760795803975679]%!(EXTRA errors.fundamental=analyze panic due to memory quota exceeds, please try with smaller samplerate(refer to 110000/count))"] [2023/06/28 21:40:22.361 +08:00] [INFO] [update.go:1189] ["[stats] auto analyze triggered"] [sql="analyze table tpcc.history"] [reason="table unanalyzed"]

2、the memory of tidb keep increasing and qps drop more than 60% last for 20min image image

4. What is your TiDB version? (Required)

["Welcome to TiDB."] ["Release Version"=v7.3.0-alpha] [Edition=Community] ["Git Commit Hash"=a7b54adfede165328fab966e288d2d9402943d7c] ["Git Branch"=heads/refs/tags/v7.3.0-alpha] ["UTC Build Time"="2023-06-28 11:14:13"] [GoVersion=go1.20.5] ["Race Enabled"=false] ["Check Table Before Drop"=false]

Lily2025 commented 1 year ago

/type bug /severity critical /assign chrysan

chrysan commented 1 year ago
image

It looks like during the time when PD network partition is simulated, TiDB got a wrong approximate count from PD and calculated a larger sample rate.

nolouch commented 1 year ago

I feel that we cannot rely heavily on PD information. Sometimes, for example, leader switching may cause statistical details not to be real-time.

Maybe change this to an improvement issue.