pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
37.01k stars 5.82k forks source link

Tracking Issue for Predicate Columns #53567

Open Rustin170506 opened 4 months ago

Rustin170506 commented 4 months ago

Feature Request

Is your feature request related to a problem? Please describe: The ANALYZE statement would collect the statistics of all columns currently. If the table is big and wide, executing ANALYZE would consume lots of time, memory, and CPU. See https://github.com/pingcap/tidb/issues/27358 for details. However, only the statistics of some columns are used in creating query plans, while the statistics of others are not. Predicate columns are those columns whose statistics are used in query plans, usually in where conditions, join conditions, and so on. If ANALYZE only collects statistics for predicate columns and indexed columns (statistics of indexed columns are important for index selection), the cost of ANALYZE can be reduced.

Describe the feature you'd like: See: https://github.com/pingcap/tidb/pull/53511 Describe alternatives you've considered: None Teachability, Documentation, Adoption, Migration Strategy: None

Rustin170506 commented 2 months ago

How do we deal with the outdated stats?

  1. Create a tidb cluster with the latest master code: tiup playground nightly --db.binpath /Volumes/t7/code/tidb/bin/tidb-server
  2. Create a table with three columns:
    
    CREATE TABLE example_table
    (
    id   INT PRIMARY KEY,
    name VARCHAR(255),
    age  INT
    );

INSERT INTO example_table (id, name, age) VALUES (1, 'Alice', 30), (2, 'Bob', 25), (3, 'Charlie', 35), (4, 'Diana', 28), (5, 'Evan', 40), (6, 'Fiona', 22), (7, 'George', 33), (8, 'Hannah', 29), (9, 'Ian', 31), (10, 'Jenny', 27);

3. Wait for the stats meta:
```sql
SELECT * FROM mysql.stats_meta;
+------------------+--------+------------+-----+--------+
|version           |table_id|modify_count|count|snapshot|
+------------------+--------+------------+-----+--------+
|451225860582932481|104     |10          |10   |0       |
+------------------+--------+------------+-----+--------+
  1. Disable tidb_persist_analyze_options: set global tidb_persist_analyze_options='OFF'
  2. Analyze table with all columns: ANALYZE TABLE example_table ALL COLUMNS
  3. Check stats:
    
    SELECT * FROM mysql.stats_meta;
    +------------------+--------+------------+-----+------------------+
    |version           |table_id|modify_count|count|snapshot          |
    +------------------+--------+------------+-----+------------------+
    |451225906812289034|104     |0           |10   |451225906799181830|
    +------------------+--------+------------+-----+------------------+
    +--------+--------+-------+--------------+----------+------------+------------+------------------+---------+---------+----+--------------------+----------------+
    |table_id|is_index|hist_id|distinct_count|null_count|tot_col_size|modify_count|version           |cm_sketch|stats_ver|flag|correlation         |last_analyze_pos|

SELECT * FROM mysql.stats_histograms; +--------+--------+-------+--------------+----------+------------+------------+------------------+---------+---------+----+--------------------+----------------+ |104 |0 |1 |10 |0 |80 |0 |451225906812289034|null |2 |1 |1 |null | |104 |0 |2 |10 |0 |59 |0 |451225906812289034|null |2 |1 |1 |null | |104 |0 |3 |10 |0 |80 |0 |451225906812289034|null |2 |1 |-0.05454545454545454|null | +--------+--------+-------+--------------+----------+------------+------------+------------------+---------+---------+----+--------------------+----------------+

7. Insert more data:
```sql
INSERT INTO example_table (id, name, age)
VALUES (1, 'Alice', 30),
       (2, 'Bob', 25),
       (3, 'Charlie', 35),
       (4, 'Diana', 28),
       (5, 'Evan', 40),
       (6, 'Fiona', 22),
       (7, 'George', 33),
       (8, 'Hannah', 29),
       (9, 'Ian', 31),
       (10, 'Jenny', 27);
  1. Select data with column a: SELECT * FROM example_table WHERE name = '';
  2. Wait for the column usage:
    SELECT * FROM mysql.column_stats_usage;
    +--------+---------+-------------------+-------------------+
    |table_id|column_id|last_used_at       |last_analyzed_at   |
    +--------+---------+-------------------+-------------------+
    |104     |1        |null               |2024-07-18 16:10:32|
    |104     |2        |2024-07-18 16:14:19|2024-07-18 16:10:32|
    |104     |3        |null               |2024-07-18 16:10:32|
    +--------+---------+-------------------+-------------------+
  3. Analyze the table without all columns syntax: ANALYZE TABLE example_table;
  4. Check the stats again:
    SELECT * FROM mysql.stats_histograms;
    +--------+--------+-------+--------------+----------+------------+------------+------------------+---------+---------+----+--------------------+----------------+
    |table_id|is_index|hist_id|distinct_count|null_count|tot_col_size|modify_count|version           |cm_sketch|stats_ver|flag|correlation         |last_analyze_pos|
    +--------+--------+-------+--------------+----------+------------+------------+------------------+---------+---------+----+--------------------+----------------+
    |104     |0       |3      |10            |0         |80          |0           |451225906812289034|null     |2        |1   |-0.05454545454545454|null            |
    |104     |0       |1      |10            |0         |80          |0           |451226051173679118|null     |2        |1   |1                   |null            |
    |104     |0       |2      |10            |0         |59          |0           |451226051173679118|null     |2        |1   |1                   |null            |
    +--------+--------+-------+--------------+----------+------------+------------+------------------+---------+---------+----+--------------------+----------------+

    As you can see column 3 still retains the old stats.

  5. Check we don't delete the old TopN for column 3:
    SELECT * FROM mysql.stats_top_n WHERE hist_id =3;
    +--------+--------+-------+--------------------+-----+
    |table_id|is_index|hist_id|value               |count|
    +--------+--------+-------+--------------------+-----+
    |104     |0       |3      |0x038000000000000028|1    |
    |104     |0       |3      |0x038000000000000023|1    |
    |104     |0       |3      |0x038000000000000021|1    |
    |104     |0       |3      |0x03800000000000001F|1    |
    |104     |0       |3      |0x03800000000000001E|1    |
    |104     |0       |3      |0x03800000000000001D|1    |
    |104     |0       |3      |0x03800000000000001C|1    |
    |104     |0       |3      |0x03800000000000001B|1    |
    |104     |0       |3      |0x038000000000000019|1    |
    |104     |0       |3      |0x038000000000000016|1    |
    +--------+--------+-------+--------------------+-----+
  6. Wait for a GC window(15min) and check the TopN again:
    +--------+--------+-------+--------------------+-----+
    |table_id|is_index|hist_id|value               |count|
    +--------+--------+-------+--------------------+-----+
    |104     |0       |3      |0x038000000000000028|1    |
    |104     |0       |3      |0x038000000000000023|1    |
    |104     |0       |3      |0x038000000000000021|1    |
    |104     |0       |3      |0x03800000000000001F|1    |
    |104     |0       |3      |0x03800000000000001E|1    |
    |104     |0       |3      |0x03800000000000001D|1    |
    |104     |0       |3      |0x03800000000000001C|1    |
    |104     |0       |3      |0x03800000000000001B|1    |
    |104     |0       |3      |0x038000000000000019|1    |
    |104     |0       |3      |0x038000000000000016|1    |
    +--------+--------+-------+--------------------+-----+