Open time-and-fate opened 2 months ago
This corresponds to two places that need to be improved:
Selectivity()
can't correctly match the stats to the filters. It should prefer stats on (iabc) for estimation, but finally chooses stats on (ib).DataSource.stats.RowCount
, and then the logic here brings the overestimation into the index range scan of index (iabc), then causes choosing the bad plan.
https://github.com/pingcap/tidb/blob/854a4e3303003e3f8c1151da27e539337dcf8277/pkg/planner/core/logical_plans.go#L1599-L1601A simple rust script to reproduce it:
#!/usr/bin/env -S cargo +nightly -Zscript
---cargo
[dependencies]
sqlx = { version = "0.7", features = ["mysql", "runtime-tokio-native-tls"] }
tokio = { version = "1", features = ["full"] }
rand = "0.8"
---
use sqlx::mysql::MySqlPool;
use rand::Rng;
use std::error::Error;
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
// Replace with your MySQL connection string
let pool = MySqlPool::connect("mysql://root@localhost:4000/test").await?;
// Create table
sqlx::query(
"CREATE TABLE IF NOT EXISTS t(
a INT,
b INT,
c INT,
d INT,
INDEX iabc(a,b,c),
INDEX ib(b)
)"
)
.execute(&pool)
.await?;
// Function to generate random data
fn generate_data() -> (i32, i32, i32, i32) {
let mut rng = rand::thread_rng();
(
rng.gen_range(0..100000),
rng.gen_range(0..10),
rng.gen_range(0..1000),
rng.gen_range(0..1000),
)
}
// Insert initial data
for _ in 0..200 {
let data: Vec<_> = (0..20).map(|_| generate_data()).collect();
for (a, b, c, d) in data {
sqlx::query("INSERT INTO t (a, b, c, d) VALUES (?, ?, ?, ?)")
.bind(a)
.bind(b)
.bind(c)
.bind(d)
.execute(&pool)
.await?;
}
}
// Double the data multiple times
for _ in 0..5 { // Adjust this number based on how much data you want
sqlx::query("INSERT INTO t SELECT * FROM t")
.execute(&pool)
.await?;
}
// Analyze table
sqlx::query("ANALYZE TABLE t")
.execute(&pool)
.await?;
println!("Script completed successfully.");
Ok(())
}
- In this case,
Selectivity()
can't correctly match the stats to the filters. It should prefer stats on (iabc) for estimation, but finally chooses stats on (ib).
// We greedy select the stats info based on:
// (1): The stats type, always prefer the primary key or index.
// (2): The number of expression that it covers, the more the better.
// (3): The number of columns that it contains, the less the better.
// (4): The selectivity of the covered conditions, the less the better.
// The rationale behind is that lower selectivity tends to reflect more functional dependencies
// between columns. It's hard to decide the priority of this rule against rule 2 and 3, in order
// to avoid massive plan changes between tidb-server versions, I adopt this conservative strategy
// to impose this rule after rule 2 and 3.
if (bestTp == ColType && set.Tp != ColType) ||
bestCount < bits ||
(bestCount == bits && bestNumCols > set.numCols) ||
(bestCount == bits && bestNumCols == set.numCols && bestSel > set.Selectivity) {
bestID, bestCount, bestTp, bestNumCols, bestMask, bestSel = i, bits, set.Tp, set.numCols, curMask, set.Selectivity
}
This is because that the column number of the stats on (ib) is less than the stats on (iabc), so it chooses the stats on (ib) first.
The steps:
GetUsableSetsByGreedy
uses incorrect statistics to estimate the selectivity of the indexes.deriveIndexPathStats
employs an unperfect algorithm to update the CountAfterAccess
.skylinePruning
prunes the ib index due to the compareCandidates
.compareTaskCost
chooses the table full scan due to the overestimated CountAfterAccess
.
Reproduce
There is a big overestimation