pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
36.4k stars 5.73k forks source link

expression: optimize function IN using hashmap #13594

Closed js00070 closed 4 years ago

js00070 commented 4 years ago

PCP #12970

What problem does this PR solve?

optimize function IN using hashmap according to #12970

What is changed and how it works?

have implemented InInt currently, will continue to implement other IN functions.

Check List

Tests

Side effects

sre-bot commented 4 years ago

Thanks for your contribution. If your PR get merged, you will be rewarded 1200 points.

js00070 commented 4 years ago

Is it ok that I conintue to implement other IN functions in this PR? Or maybe I should split those functions into different PRs? @winoros

winoros commented 4 years ago

@js00070 You can split them into multiple prs. Since they are mostly independent codes, putting them into one single pr is ok to me though. Seems that the PCP robot cannot deal with multiple pr correctly. You can put them into one pr first.

winoros commented 4 years ago

Also, you can run make test or make dev to run some checks in your local env.

codecov[bot] commented 4 years ago

Codecov Report

Merging #13594 into master will not change coverage. The diff coverage is n/a.

@@            Coverage Diff            @@
##            master    #13594   +/-   ##
=========================================
  Coverage   80.449%   80.449%           
=========================================
  Files          483       483           
  Lines       123452    123452           
=========================================
  Hits         99316     99316           
  Misses       16355     16355           
  Partials      7781      7781
js00070 commented 4 years ago

the bench result is here: for inInt

const count =  1 , threshold =  1
BenchmarkVectorizedBuiltinOtherFunc/builtinInIntSig-VecBuiltinFunc-12                     111224             10272 ns/op            1024 B/op          1 allocs/op
BenchmarkVectorizedBuiltinOtherFunc/builtinInIntSig-NonVecBuiltinFunc-12                   62086             18713 ns/op               0 B/op          0 allocs/op

const count =  1 , threshold =  200
BenchmarkVectorizedBuiltinOtherFunc/builtinInIntSig-VecBuiltinFunc-12                      52753             22356 ns/op            1024 B/op          1 allocs/op
BenchmarkVectorizedBuiltinOtherFunc/builtinInIntSig-NonVecBuiltinFunc-12                   16131             74576 ns/op               0 B/op          0 allocs/op

for inStr

count=1, threshold=100
BenchmarkVectorizedBuiltinOtherFunc/builtinInStringSig-VecBuiltinFunc-12                           15488             76004 ns/op            1024 B/op          1 allocs/op
BenchmarkVectorizedBuiltinOtherFunc/builtinInStringSig-NonVecBuiltinFunc-12                        13756             85849 ns/op               0 B/op          0 allocs/op

count=1, threshold=1
BenchmarkVectorizedBuiltinOtherFunc/builtinInStringSig-VecBuiltinFunc-12                           69493             16724 ns/op            1024 B/op          1 allocs/op
BenchmarkVectorizedBuiltinOtherFunc/builtinInStringSig-NonVecBuiltinFunc-12                        45828             25678 ns/op               0 B/op          0 allocs/op

so the threshold for inInt and inStr is 1.

for inTime

count=1, threshold=100
BenchmarkVectorizedBuiltinOtherFunc/builtinInTimeSig-VecBuiltinFunc-12                             57450             19978 ns/op            1024 B/op          1 allocs/op
BenchmarkVectorizedBuiltinOtherFunc/builtinInTimeSig-NonVecBuiltinFunc-12                          23920             49055 ns/op               0 B/op          0 allocs/op

count=1, threshold=1
BenchmarkVectorizedBuiltinOtherFunc/builtinInTimeSig-VecBuiltinFunc-12                             44167             26372 ns/op            1024 B/op          1 allocs/op
BenchmarkVectorizedBuiltinOtherFunc/builtinInTimeSig-NonVecBuiltinFunc-12                          29161             40562 ns/op               0 B/op          0 allocs/op

count=2, threshold=1
BenchmarkVectorizedBuiltinOtherFunc/builtinInTimeSig-VecBuiltinFunc-12                             44262             26342 ns/op            1024 B/op          1 allocs/op
BenchmarkVectorizedBuiltinOtherFunc/builtinInTimeSig-NonVecBuiltinFunc-12                          29365             40589 ns/op               0 B/op          0 allocs/op

count=2, threshold=100
BenchmarkVectorizedBuiltinOtherFunc/builtinInTimeSig-VecBuiltinFunc-12                             33315             35780 ns/op            1024 B/op          1 allocs/op
BenchmarkVectorizedBuiltinOtherFunc/builtinInTimeSig-NonVecBuiltinFunc-12                          15873             76252 ns/op               0 B/op          0 allocs/op

so the threshold for inTime is 2.

winoros commented 4 years ago

/run-all-tests

winoros commented 4 years ago

Seems that there're some integration tests failed. I'll look into it first.

js00070 commented 4 years ago

in Common Test / Mysql Test

[2019-11-24T14:28:57.579Z] 2019/11/24 22:28:57.487  [fatal] run test [json_gcol] err: sql:explain select f1 from t1 where json_extract(f1,"$") in ("asd","asasas","asdf");: run "explain select f1 from t1 where json_extract(f1,"$") in ("asd","asasas","asdf");" at line 76 err Error 3140: Invalid JSON text: The document root must not be followed by other values.

[2019-11-24T14:28:57.579Z] + echo 'tidb-server(PID: 148214) stopped'

Seems the sql should be as follow?

select f1 from t1 where json_extract(f1,"$") in ("\"asd\"","\"asasas\"","\"asdf\"");
js00070 commented 4 years ago

It seems there is a incompatible issue with mysql about JSON funciton. here is the phenomenon:

in MySQL:8

mysql> select json_extract("\"asd\"","$") in ("asd","abcd");
+-----------------------------------------------+
| json_extract("\"asd\"","$") in ("asd","abcd") |
+-----------------------------------------------+
|                                             1 |
+-----------------------------------------------+
1 row in set (0.00 sec)

in MySQL:5.7

mysql> select json_extract("\"asd\"","$") in ("asd","abcd");
+----------------------------------------------+
| json_extract("\"asd\"","$") in ("asd","abc") |
+----------------------------------------------+
|                                            0 |
+----------------------------------------------+
1 row in set, 1 warning (0.00 sec)

in TiDB master branch (v4.0.0-alpha-940-g92e774913, commit hash: 92e774913a2f6166b104d3d8eaed5513733f1830)

MySQL [(none)]> select json_extract("\"asd\"","$") in ("asd","abcd");
ERROR 3140 (22032): Invalid JSON text: The document root must not be followed by other values.
MySQL [(none)]>

I submit an issue here https://github.com/pingcap/tidb/issues/13710

winoros commented 4 years ago

/run-sqllogic-test

winoros commented 4 years ago

@js00070 The following is one failed case in sql logic test.

CREATE TABLE tab0(col0 INTEGER, col1 INTEGER, col2 INTEGER);
CREATE TABLE tab1(col0 INTEGER, col1 INTEGER, col2 INTEGER);
CREATE TABLE tab2(col0 INTEGER, col1 INTEGER, col2 INTEGER);
INSERT INTO tab0 VALUES(97,1,99);
INSERT INTO tab0 VALUES(15,81,47);
INSERT INTO tab0 VALUES(87,21,10);
INSERT INTO tab1 VALUES(51,14,96);
INSERT INTO tab1 VALUES(85,5,59);
INSERT INTO tab1 VALUES(91,47,68);
INSERT INTO tab2 VALUES(64,77,40);
INSERT INTO tab2 VALUES(75,67,58);
INSERT INTO tab2 VALUES(46,51,23);

expected ["-37"], but found [] - sql[ SELECT DISTINCT col1 DIV - - col1 + - 38 AS col1 FROM tab0 AS cor0 WHERE - col1 * col0 DIV - - col2 IN ( - 2, - col1 DIV + col2 ) ]

js00070 commented 4 years ago

/run-sqllogic-test

winoros commented 4 years ago

@js00070 The data for the first failed case:

CREATE TABLE tab0(col0 INTEGER, col1 INTEGER, col2 INTEGER);
CREATE TABLE tab1(col0 INTEGER, col1 INTEGER, col2 INTEGER);
CREATE TABLE tab2(col0 INTEGER, col1 INTEGER, col2 INTEGER);
INSERT INTO tab0 VALUES(97,1,99);
INSERT INTO tab0 VALUES(15,81,47);
INSERT INTO tab0 VALUES(87,21,10);
INSERT INTO tab1 VALUES(51,14,96);
INSERT INTO tab1 VALUES(85,5,59);
INSERT INTO tab1 VALUES(91,47,68);
INSERT INTO tab2 VALUES(64,77,40);
INSERT INTO tab2 VALUES(75,67,58);
INSERT INTO tab2 VALUES(46,51,23);

The data for another two failed case:

CREATE TABLE tab0(col0 INTEGER, col1 INTEGER, col2 INTEGER);
CREATE TABLE tab1(col0 INTEGER, col1 INTEGER, col2 INTEGER);
CREATE TABLE tab2(col0 INTEGER, col1 INTEGER, col2 INTEGER);
INSERT INTO tab0 VALUES(97,1,99);
INSERT INTO tab0 VALUES(15,81,47);
INSERT INTO tab0 VALUES(87,21,10);
INSERT INTO tab1 VALUES(51,14,96);
INSERT INTO tab1 VALUES(85,5,59);
INSERT INTO tab1 VALUES(91,47,68);
INSERT INTO tab2 VALUES(64,77,40);
INSERT INTO tab2 VALUES(75,67,58);
INSERT INTO tab2 VALUES(46,51,23);
js00070 commented 4 years ago

/run-sqllogic-test

js00070 commented 4 years ago

this fail case is very werid.

[error] /git/sqllogictest/test/index/random/10/slt_good_2.test:20581: expected [], but found ["0" "356" "346" "0" "720" "678" "0" "1" "84" "398" "0" "573" "235" "0" "2" "525" "890" "0" "690" "465" "0" "3" "43" "9" "0" "223" "197" "0" "4" "606" "129" "0" "989" "259" "0" "5" "912" "442" "0" "596" "110" "0" "6" "512" "576" "0" "568" "457" "0" "7" "933" "226" "0" "471" "886" "0" "8" "810" "775" "0" "36" "485" "0" "9" "3" "921" "0" "284" "468" "0"] - sql[ SELECT * FROM tab0 AS cor0 WHERE + col1 NOT IN ( - 82, - col1, 96, col1, CAST( - ( + + 30 ) AS SIGNED ) ) ]

where col1 not in (..., col1, ... ) will always be false. it seems impossible to fail on this case. Could you show me the data of this case? @winoros

winoros commented 4 years ago

@js00070 Sorry for late reply Here's the data

CREATE TABLE tab0(pk INTEGER PRIMARY KEY, col0 INTEGER, col1 FLOAT, col2 TEXT, col3 INTEGER, col4 FLOAT, col5 TEXT);
INSERT INTO tab0 VALUES(0,356,346.29,'emyjj',720,678.28,'mylwf');
INSERT INTO tab0 VALUES(1,84,398.55,'sbaji',573,235.64,'dgkbk');
INSERT INTO tab0 VALUES(2,525,890.27,'cbpha',690,465.9,'toihb');
INSERT INTO tab0 VALUES(3,43,9.3,'bqwdh',223,197.28,'vrkrw');
INSERT INTO tab0 VALUES(4,606,129.87,'axjxg',989,259.90,'ylzxx');
INSERT INTO tab0 VALUES(5,912,442.38,'uyrso',596,110.71,'gtdhg');
INSERT INTO tab0 VALUES(6,512,576.81,'cyvvl',568,457.76,'erkxs');
INSERT INTO tab0 VALUES(7,933,226.24,'fkgxz',471,886.45,'bqisj');
INSERT INTO tab0 VALUES(8,810,775.69,'sfbbw',36,485.66,'tlesg');
INSERT INTO tab0 VALUES(9,3,921.21,'wsyzc',284,468.90,'hpkbx');
CREATE TABLE tab1(pk INTEGER PRIMARY KEY, col0 INTEGER, col1 FLOAT, col2 TEXT, col3 INTEGER, col4 FLOAT, col5 TEXT);
CREATE INDEX idx_tab1_0 on tab1 (col0);
CREATE INDEX idx_tab1_1 on tab1 (col1);
CREATE INDEX idx_tab1_3 on tab1 (col3);
CREATE INDEX idx_tab1_4 on tab1 (col4);
INSERT INTO tab1 SELECT * FROM tab0;
CREATE TABLE tab2(pk INTEGER PRIMARY KEY, col0 INTEGER, col1 FLOAT, col2 TEXT, col3 INTEGER, col4 FLOAT, col5 TEXT);
CREATE INDEX idx_tab2_0 ON tab2 (col1,col0,col4);
CREATE INDEX idx_tab2_1 ON tab2 (col1,col0 DESC,col3 DESC);
CREATE INDEX idx_tab2_2 ON tab2 (col4);
CREATE UNIQUE INDEX idx_tab2_3 ON tab2 (col0);
CREATE INDEX idx_tab2_4 ON tab2 (col3);
INSERT INTO tab2 SELECT * FROM tab0
CREATE TABLE tab3(pk INTEGER PRIMARY KEY, col0 INTEGER, col1 FLOAT, col2 TEXT, col3 INTEGER, col4 FLOAT, col5 TEXT);
CREATE UNIQUE INDEX idx_tab3_0 ON tab3 (col1);
CREATE INDEX idx_tab3_2 ON tab3 (col4);
CREATE INDEX idx_tab3_5 ON tab3 (col0 DESC);
INSERT INTO tab3 SELECT * FROM tab0;
CREATE TABLE tab4(pk INTEGER PRIMARY KEY, col0 INTEGER, col1 FLOAT, col2 TEXT, col3 INTEGER, col4 FLOAT, col5 TEXT);
CREATE UNIQUE INDEX idx_tab4_0 ON tab4 (col1,col4,col3 DESC);
CREATE INDEX idx_tab4_1 ON tab4 (col1,col0);
CREATE UNIQUE INDEX idx_tab4_5 ON tab4 (col4);
INSERT INTO tab4 SELECT * FROM tab0;
js00070 commented 4 years ago

/run-sqllogic-test

js00070 commented 4 years ago

/run-all-tests

js00070 commented 4 years ago

this CI failure is related to https://github.com/pingcap/tidb/issues/13710

[fatal] run test [json_gcol] err: sql:explain select f1 from t1 where json_extract(f1,"$") in ("asd","asasas","asdf");: run "explain select f1 from t1 where json_extract(f1,"$") in ("asd","asasas","asdf");" at line 76 err Error 3140: Invalid JSON text: The document root must not be followed by other values.

I will make another PR to solve it.

winoros commented 4 years ago

Great! I'll begin to review this pr tonight.

js00070 commented 4 years ago

/run-all-tests

SunRunAway commented 4 years ago

Friendly ping @js00070

js00070 commented 4 years ago

Friendly ping @js00070

I'm a little busy this week. I will update the code this weekend.

js00070 commented 4 years ago

/run-all-tests

js00070 commented 4 years ago

/run-all-tests

js00070 commented 4 years ago

/run-all-tests

SunRunAway commented 4 years ago

Hi, please help take a look at the integration test failure.

[2019-12-15T10:33:26.909Z] 2019/12/15 18:33:26.881 [fatal] run test [json_gcol] err: sql:explain select f1 from t1 where json_extract(f1,"$") in ("asd","asasas","asdf");: run "explain select f1 from t1 where json_extract(f1,"$") in ("asd","asasas","asdf");" at line 76 err Error 3140: Invalid JSON text: The document root must not be followed by other values.

js00070 commented 4 years ago

Hi, please help take a look at the integration test failure.

[2019-12-15T10:33:26.909Z] 2019/12/15 18:33:26.881 [fatal] run test [json_gcol] err: sql:explain select f1 from t1 where json_extract(f1,"$") in ("asd","asasas","asdf");: run "explain select f1 from t1 where json_extract(f1,"$") in ("asd","asasas","asdf");" at line 76 err Error 3140: Invalid JSON text: The document root must not be followed by other values.

this CI failure is because of this issue https://github.com/pingcap/tidb/issues/13710 I will fix it in another PR

js00070 commented 4 years ago

/run-all-tests

js00070 commented 4 years ago

/run-all-tests

js00070 commented 4 years ago

/run-all-tests

js00070 commented 4 years ago

/run-all-tests

js00070 commented 4 years ago

/run-all-tests

js00070 commented 4 years ago

/run-all-tests

js00070 commented 4 years ago

PTAL @SunRunAway the workaround for https://github.com/pingcap/tidb/issues/13710 is conflict to the test case......

[2019-12-18T14:12:23.484Z] "explain select f1 from t1 where json_extract(f1,"$") in ("asd","asasas","asdf");" 

[2019-12-18T14:12:23.484Z]  around line 76, 

[2019-12-18T14:12:23.484Z] we need(354):

[2019-12-18T14:12:23.484Z] explain select f1 from t1 where json_extract(f1,"$") in ("asd","asasas","asdf");

[2019-12-18T14:12:23.484Z] id   count   task    operator info

[2019-12-18T14:12:23.484Z] Selection_5  8000.00 root    in(json_extract(json_gcol.t1.f1, "$"), cast("asd"), cast("asasas"), cast("asdf"))

[2019-12-18T14:12:23.484Z] └─TableReader_7  10000.00    root    data:TableScan_6

[2019-12-18T14:12:23.484Z]   └─TableScan_6  10000.00    cop[tikv]   table:t1, range:[-inf,+inf], keep order:false, s

[2019-12-18T14:12:23.484Z] but got(354):

[2019-12-18T14:12:23.484Z] explain select f1 from t1 where json_extract(f1,"$") in ("asd","asasas","asdf");

[2019-12-18T14:12:23.484Z] id   count   task    operator info

[2019-12-18T14:12:23.484Z] Selection_5  8000.00 root    in(cast(json_extract(json_gcol.t1.f1, "$")), "asd", "asasas", "asdf")

[2019-12-18T14:12:23.484Z] └─TableReader_7  10000.00    root    data:TableScan_6

[2019-12-18T14:12:23.484Z]   └─TableScan_6  10000.00    cop[tikv]   table:t1, range:[-inf,+inf], keep order:false, stats:pseudo
TennyZhuang commented 4 years ago

@SunRunAway It seems that the current version of tidb ignore the eps, and this PR keep the bug. I think that we can merge this PR, and file a new issue to track about real comparison compatibility with MySQL.

SunRunAway commented 4 years ago

@SunRunAway It seems that the current version of tidb ignore the eps, and this PR keep the bug. I think that we can merge this PR, and file a new issue to track about real comparison compatibility with MySQL.

Agree with you. @js00070 Could you modify you PR and remove the JSON part, and help create a new issue to track real comparison compatibility with Hash Join executor and Hash In function?

Let's make this PR be merged.

js00070 commented 4 years ago

@SunRunAway It seems that the current version of tidb ignore the eps, and this PR keep the bug. I think that we can merge this PR, and file a new issue to track about real comparison compatibility with MySQL.

Agree with you. @js00070 Could you modify you PR and remove the JSON part, and help create a new issue to track real comparison compatibility with Hash Join executor and Hash In function?

Let's make this PR be merged.

ok. And How do we handle the CI failure that we discussed last week?

SunRunAway commented 4 years ago

@SunRunAway It seems that the current version of tidb ignore the eps, and this PR keep the bug. I think that we can merge this PR, and file a new issue to track about real comparison compatibility with MySQL.

Agree with you. @js00070 Could you modify you PR and remove the JSON part, and help create a new issue to track real comparison compatibility with Hash Join executor and Hash In function? Let's make this PR be merged.

ok. And How do we handle the CI failure that we discussed last week?

We can remove the JSON implementation, so that the CI will pass. Then we handle the JSON compatibility issue in #13710.

js00070 commented 4 years ago

/run-all-tests

js00070 commented 4 years ago

/run-all-tests

sre-bot commented 4 years ago

@winoros, @SunRunAway, @TennyZhuang, @breeswish, @qw4990, @XuHuaiyu, @wshwsh12, PTAL.

sre-bot commented 4 years ago

@winoros, @SunRunAway, @TennyZhuang, @breeswish, @qw4990, @XuHuaiyu, @wshwsh12, PTAL.

sre-bot commented 4 years ago

@js00070, please update your pull request.

sre-bot commented 4 years ago

@winoros, @SunRunAway, @TennyZhuang, @breeswish, @qw4990, @XuHuaiyu, @wshwsh12, PTAL.

sre-bot commented 4 years ago

@winoros, @SunRunAway, @TennyZhuang, @breeswish, @XuHuaiyu, @qw4990, @wshwsh12, PTAL.

sre-bot commented 4 years ago

@winoros, @SunRunAway, @TennyZhuang, @breeswish, @XuHuaiyu, @qw4990, @wshwsh12, PTAL.

js00070 commented 4 years ago

/run-all-tests

js00070 commented 4 years ago

/run-all-tests