pingcap / tidb

TiDB - the open-source, cloud-native, distributed SQL database designed for modern applications.
https://pingcap.com
Apache License 2.0
37.26k stars 5.85k forks source link

planner: predicates cannot be pushed into sub-queries for HashJoin #36551

Open qw4990 opened 2 years ago

qw4990 commented 2 years ago

Enhancement

mysql> CREATE TABLE t1 (
    -> risk_id varchar(32) NOT NULL,
    -> tran_id int NOT NULL,
    -> KEY (`risk_id`)
    -> );

mysql> CREATE TABLE t2 (
    -> risk_id varchar(32) COLLATE utf8mb4_general_ci NOT NULL,
    -> -- risk_id varchar(32) NOT NULL,
    -> tran_id int NOT NULL,
    -> KEY (`risk_id`)
    -> );

mysql> explain select /*+ hash_join(t1) */ t1.risk_id 
    -> from t1 
    -> where t1.risk_id = 'xxxx' 
    -> and exists (select 1 from t2 where t2.risk_id=t1.risk_id);
+-----------------------------+----------+-----------+----------------------------------+---------------------------------------------------------+
| id                          | estRows  | task      | access object                    | operator info                                           |
+-----------------------------+----------+-----------+----------------------------------+---------------------------------------------------------+
| HashJoin_18                 | 8.00     | root      |                                  | semi join, equal:[eq(test.t1.risk_id, test.t2.risk_id)] |
| ├─IndexReader_24(Build)     | 10000.00 | root      |                                  | index:IndexFullScan_23                                  |
| │ └─IndexFullScan_23        | 10000.00 | cop[tikv] | table:t2, index:risk_id(risk_id) | keep order:false, stats:pseudo                          |
| └─IndexReader_20(Probe)     | 10.00    | root      |                                  | index:IndexRangeScan_19                                 |
|   └─IndexRangeScan_19       | 10.00    | cop[tikv] | table:t1, index:risk_id(risk_id) | range:["xxxx","xxxx"], keep order:false, stats:pseudo   |
+-----------------------------+----------+-----------+----------------------------------+---------------------------------------------------------+

The predicate t1.risk_id = 'xxxx' should be propagated to t2 and the optimizer should generate a IndexRangeScan for t2. If remove COLLATE utf8mb4_general_ci, this predicate can be pushed down.

mysql> CREATE TABLE t2 (
    -> -- risk_id varchar(32) COLLATE utf8mb4_general_ci NOT NULL,
    -> risk_id varchar(32) NOT NULL,
    -> tran_id int NOT NULL,
    -> KEY (`risk_id`)
    -> );
Query OK, 0 rows affected (0.01 sec)

mysql> explain select /*+ hash_join(t1) */ t1.risk_id 
    -> from t1 
    -> where t1.risk_id = 'xxxx' 
    -> and exists (select 1 from t2 where t2.risk_id=t1.risk_id);
+-----------------------------+---------+-----------+----------------------------------+---------------------------------------------------------+
| id                          | estRows | task      | access object                    | operator info                                           |
+-----------------------------+---------+-----------+----------------------------------+---------------------------------------------------------+
| HashJoin_21                 | 8.00    | root      |                                  | semi join, equal:[eq(test.t1.risk_id, test.t2.risk_id)] |
| ├─IndexReader_25(Build)     | 10.00   | root      |                                  | index:IndexRangeScan_24                                 |
| │ └─IndexRangeScan_24       | 10.00   | cop[tikv] | table:t2, index:risk_id(risk_id) | range:["xxxx","xxxx"], keep order:false, stats:pseudo   |
| └─IndexReader_23(Probe)     | 10.00   | root      |                                  | index:IndexRangeScan_22                                 |
|   └─IndexRangeScan_22       | 10.00   | cop[tikv] | table:t1, index:risk_id(risk_id) | range:["xxxx","xxxx"], keep order:false, stats:pseudo   |
+-----------------------------+---------+-----------+----------------------------------+---------------------------------------------------------+
qw4990 commented 2 years ago

PTAL @fixdb @winoros @AilinKid

RagingSpud commented 2 years ago

It seems like that tikv does not support this kind of collation

MDK8888 commented 1 year ago

Hey, I would love to work on this!

xuyukeviki commented 1 year ago

/assign

isabella0428 commented 1 year ago

After examining the code, I think the problem here is that we currently only support constant propagating for columns with same collation.

See codes below(expression/constant_propagation.go: L238-259) `func (s *propConstSolver) propagateColumnEQ() {

for i := range s.conditions {
    if fun, ok := s.conditions[i].(*ScalarFunction); ok && fun.FuncName.L == ast.EQ {
        ......................
        // TODO: Enable hybrid types in ConstantPropagate.
        if lOk && rOk && lCol.GetType().GetCollate() == rCol.GetType().GetCollate() && !lCol.GetType().Hybrid() && !rCol.GetType().Hybrid() {
            lID := s.getColID(lCol)
            rID := s.getColID(rCol)
            s.unionSet.Union(lID, rID)
            visited[i] = true
        }
    }
}`

Only if lCol.GetType().GetCollate() == rCol.GetType().GetCollate(), we will generate the condition test.t2.risk_id = 'xxx', which will then be pushed down and finally create the IndexRangeScan for test2.

Should we compare the type without the collate()?

isabella0428 commented 1 year ago

Could you guys pls verify my thoughts? @qw4990

quocbinhngo commented 1 year ago

/assign