Open Lily2025 opened 1 year ago
/type bug /severity critical /assign chrysan
It looks like during the time when PD network partition is simulated, TiDB got a wrong approximate count from PD and calculated a larger sample rate.
I feel that we cannot rely heavily on PD information. Sometimes, for example, leader switching may cause statistical details not to be real-time.
Maybe change this to an improvement issue.
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
1、start pitr task and ticdc changefeed task 2、run tpcc (1000 warehouse) 3、after 10min, simulate pd leader network partition (pd elect normally)
2. What did you expect to see? (Required)
1、the memory of tidb shoule not keep increasing 2、qps can recovery with in 2min after simulate pd leader network partition
3. What did you see instead (Required)
1、the memory of tidb keep increasing [2023/06/28 21:40:21.394 +08:00] [WARN] [servermemorylimit.go:148] ["global memory controller tries to kill the top1 memory consumer"] [conn=83760795803975679] ["sql digest"=bc2d161067078a11d7fc1e202182dd6eb547f1f5a2ecc14998779df9c3131e70] ["sql text"="analyze table
tpcc
.history
"] [tidb_server_memory_limit=27487790640] ["heap inuse"=27493703680] ["sql memory usage"=19233776553] [2023/06/28 21:40:22.355 +08:00] [WARN] [tracker.go:464] ["global memory controller, NeedKill signal is received successfully"] [conn=83760795803975679] [2023/06/28 21:40:22.355 +08:00] [WARN] [expensivequery.go:145] ["memory exceeds quota"] [cost_time=938.657750086s] [conn=83760795803975679] [txn_start_ts=0] [mem_max="19338634209 Bytes (18.0 GB)"] [sql="analyze tabletpcc
.history
"] [2023/06/28 21:40:22.355 +08:00] [ERROR] [analyze_col_v2.go:645] ["analyze worker panicked"] [recover="Your query has been cancelled due to exceeding the allowed memory limit for the tidb-server instance and this query is currently using the most memory. Please try narrowing your query scope or increase the tidb_server_memory_limit and try again.[conn=83760795803975679]"] [stack="github.com/pingcap/tidb/executor.(AnalyzeColumnsExecV2).subBuildWorker.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/analyze_col_v2.go:645\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:884\ngithub.com/pingcap/tidb/util/memory.(PanicOnExceed).Action\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/action.go:167\ngithub.com/pingcap/tidb/util/memory.(Tracker).Consume.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:455\ngithub.com/pingcap/tidb/util/memory.(Tracker).Consume\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:467\ngithub.com/pingcap/tidb/statistics.BuildHistAndTopN.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/statistics/builder.go:246\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:884\ngithub.com/pingcap/tidb/util/memory.(PanicOnExceed).Action\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/action.go:167\ngithub.com/pingcap/tidb/util/memory.(Tracker).Consume.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:455\ngithub.com/pingcap/tidb/util/memory.(Tracker).Consume\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:467\ngithub.com/pingcap/tidb/util/memory.(Tracker).BufferedConsume\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:497\ngithub.com/pingcap/tidb/statistics.BuildHistAndTopN.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/statistics/builder.go:257\ngithub.com/pingcap/tidb/statistics.BuildHistAndTopN\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/statistics/builder.go:371\ngithub.com/pingcap/tidb/executor.(AnalyzeColumnsExecV2).subBuildWorker\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/analyze_col_v2.go:779\ngithub.com/pingcap/tidb/executor.(AnalyzeColumnsExecV2).buildSamplingStats.func3\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/analyze_col_v2.go:348\ngithub.com/pingcap/tidb/executor.(notifyErrorWaitGroupWrapper).Run.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/analyze_utils.go:127"] [2023/06/28 21:40:22.355 +08:00] [ERROR] [analyze.go:311] ["analyze failed"] [error="Your query has been cancelled due to exceeding the allowed memory limit for the tidb-server instance and this query is currently using the most memory. Please try narrowing your query scope or increase the tidb_server_memory_limit and try again.[conn=83760795803975679]%!(EXTRA errors.fundamental=analyze panic due to memory quota exceeds, please try with smaller samplerate(refer to 110000/count))"] [2023/06/28 21:40:22.359 +08:00] [INFO] [analyze.go:589] ["analyze tabletpcc
.history
has failed"] [partition=] ["job info"="auto analyze table all columns with 256 buckets, 500 topn, 0.3516556587789237 samplerate"] ["start time"=2023/06/28 21:24:53.839 +08:00] ["end time"=2023/06/28 21:40:22.355 +08:00] [cost=15m28.51620183s] [2023/06/28 21:40:22.359 +08:00] [INFO] [tidb.go:285] ["rollbackTxn called due to ddl/autocommit failure"] [2023/06/28 21:40:22.359 +08:00] [WARN] [session.go:2284] ["run statement failed"] [schemaVersion=48] [error="Your query has been cancelled due to exceeding the allowed memory limit for the tidb-server instance and this query is currently using the most memory. Please try narrowing your query scope or increase the tidb_server_memory_limit and try again.[conn=83760795803975679]%!(EXTRA errors.fundamental=analyze panic due to memory quota exceeds, please try with smaller samplerate(refer to 110000/count))"] [session="{\n \"currDBName\": \"\",\n \"id\": 83760795803975679,\n \"status\": 2,\n \"strictMode\": true,\n \"user\": null\n}"] [2023/06/28 21:40:22.359 +08:00] [ERROR] [update.go:1315] ["[stats] auto analyze failed"] [sql="analyze tabletpcc
.history
"] [cost_time=15m38.661931978s] [error="Your query has been cancelled due to exceeding the allowed memory limit for the tidb-server instance and this query is currently using the most memory. Please try narrowing your query scope or increase the tidb_server_memory_limit and try again.[conn=83760795803975679]%!(EXTRA errors.fundamental=analyze panic due to memory quota exceeds, please try with smaller samplerate(refer to 110000/count))"] [2023/06/28 21:40:22.361 +08:00] [INFO] [update.go:1189] ["[stats] auto analyze triggered"] [sql="analyze tabletpcc
.history
"] [reason="table unanalyzed"]2、the memory of tidb keep increasing and qps drop more than 60% last for 20min
4. What is your TiDB version? (Required)
["Welcome to TiDB."] ["Release Version"=v7.3.0-alpha] [Edition=Community] ["Git Commit Hash"=a7b54adfede165328fab966e288d2d9402943d7c] ["Git Branch"=heads/refs/tags/v7.3.0-alpha] ["UTC Build Time"="2023-06-28 11:14:13"] [GoVersion=go1.20.5] ["Race Enabled"=false] ["Check Table Before Drop"=false]