inaccurate cardinality estimation on multi-column indexes when stats-ver2 is enabled

qw4990 commented 3 years ago

Development Task

mysql> set tidb_analyze_version=2;
Query OK, 0 rows affected (0.00 sec)

mysql> create table t (a int, b int, key(a, b));
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t values (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8);
Query OK, 8 rows affected (0.01 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> analyze table t with 2 topn, 3 buckets;
Query OK, 0 rows affected (0.02 sec)

mysql> explain select * from t where a=1;
+------------------------+---------+-----------+------------------------+-------------------------------+
| id                     | estRows | task      | access object          | operator info                 |
+------------------------+---------+-----------+------------------------+-------------------------------+
| IndexReader_6          | 1.33    | root      |                        | index:IndexRangeScan_5        |
| └─IndexRangeScan_5     | 1.33    | cop[tikv] | table:t, index:a(a, b) | range:[1,1], keep order:false |
+------------------------+---------+-----------+------------------------+-------------------------------+
2 rows in set (0.00 sec)

The estRows should be 1.00 instead of 1.33;

eurekaka commented 3 years ago

0.33 of the row count is from histogram, while the remained 1.00 is from the TopN of the index. TopN includes (1,1) and (2,2), but the bounds of the histogram are (1,1),(4,4),(5,5),(8,8), looks like the histogram does not exclude those items in TopN. Also, the NDV of the histogram is 8, which is not as expected.

eurekaka commented 3 years ago

analyze table t, analyze table t with 2 topn and analyze table t with 3 buckets leads to correct results, while analyze table t with 2 topn, 3 buckets leads to wrong results.

eurekaka commented 3 years ago

The histograms are collected individually in tikv and are merged in tidb. After we extract the TopN items from the merged histogram, the bucket intervals can only be narrowed down with careful examination of the discrete values around the bucket boundary. For this case, the first bucket [1,4) can be narrowed down to [3,4) since a is of integer type, and 1 and 2 are extracted to the TopN, we can do this optimization but it is pretty ad-hoc and cannot be applied to other data types.

pingcap / tidb

inaccurate cardinality estimation on multi-column indexes when stats-ver2 is enabled #22532

Development Task