pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
36.94k stars 5.81k forks source link

JSON filtering error after horizontal connection #25813

Closed linhantao closed 2 years ago

linhantao commented 3 years ago

TIDB version : V5.1 / V5.0 error message:[Err] 3140 - Invalid JSON text: The document root must not be followed by other values.

When left join is used, JSON value filtering is performed on the right table, and an error occur, When using a right join, JSON value filtering is performed on the left table and an error occurs. for example: image

When a left join is used, JSON value filtering is performed on the left table without error.Such as: image

However, the values in the JSON filter after the concatenation are all numeric, so the query can be executed without letters for example: image image

jingshanglu commented 3 years ago

@linhantao Thanks for your feedback, can you add some information about the table structure and the steps to reproduce?

linhantao commented 3 years ago

@jingshanglu repeat the steps as follows: 1、create one table :

create table test_json (id varchar(100), 家庭住址 json,other varchar(100));

2、insert one data:

insert into test_json values('1','{\"detail\": \"\", \"list\": [{\"id\": \"41bb58cab2954a768b7a2e85e82ef369\", \"value\": \"湖北省\"}, {\"id\": \"0dbbcdb6b6524f5fb47585e49fda44be\", \"value\": \"武汉市\"}, {\"id\": \"0e6f75f3bb1e46b8a329fc9aa62187f4\", \"value\": \"东西湖区\"}]}','其他数据');

3、error sql:

select from test_json A left join (select from test_json where false) B on A.家庭住址 =B.家庭住址 where B.家庭住址 ->'$.list[0].id'='0asdsss'

image

4、 success sql:

select from test_json A left join (select from test_json where false) B on A.家庭住址 =B.家庭住址 where A.家庭住址 ->'$.list[0].id'='0asdsss'

image

select from test_json A left join (select from test_json where false) B on A.家庭住址 =B.家庭住址 where B.家庭住址 ->'$.list[0].id'='122'

image

solotzg commented 2 years ago

There is something wrong when parsing binary from json str

[2022/05/26 02:39:06.733 +00:00] [INFO] [conn.go:1149] ["command dispatched failed"] [conn=1396212641508098479] [connInfo="id:1396212641508098479, addr:172.26.0.1:60152 status:10, collation:utf8_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="select * from test_json A left join (select * from test_json where false) B on A.addr =B.addr where B.addr ->'$.list[0].id'='0asdsss'"] [txn_mode=PESSIMISTIC] [timestamp=433463688357478401] [err="[json:3140]Invalid JSON text: The document root must not be followed by other values.
github.com/pingcap/errors.AddStack
    /go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174
github.com/pingcap/errors.(*Error).GenWithStackByArgs
    /go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/normalize.go:164
github.com/pingcap/tidb/types/json.ParseBinaryFromString
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/types/json/binary.go:403
github.com/pingcap/tidb/expression.(*builtinCastStringAsJSONSig).vecEvalJSON
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/expression/builtin_cast_vec.go:795
github.com/pingcap/tidb/expression.(*ScalarFunction).VecEvalJSON
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/expression/scalar_function.go:78
github.com/pingcap/tidb/expression.(*builtinEQJSONSig).vecEvalInt
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/expression/builtin_compare_vec_generated.go:1124
github.com/pingcap/tidb/expression.(*ScalarFunction).VecEvalInt
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/expression/scalar_function.go:48
github.com/pingcap/tidb/expression.EvalExpr
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/expression/expression.go:566
github.com/pingcap/tidb/expression.VecEvalBool
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/expression/expression.go:361
github.com/pingcap/tidb/expression.vectorizedFilter
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/expression/chunk_executor.go:489
github.com/pingcap/tidb/expression.VectorizedFilterConsiderNull
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/expression/chunk_executor.go:401
github.com/pingcap/tidb/expression.VectorizedFilter
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/expression/chunk_executor.go:379
github.com/pingcap/tidb/executor.(*SelectionExec).Next
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/executor.go:1420
github.com/pingcap/tidb/executor.Next
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/executor.go:319
github.com/pingcap/tidb/executor.(*recordSet).Next
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/adapter.go:151
github.com/pingcap/tidb/server.(*tidbResultSet).Next
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/driver_tidb.go:323
github.com/pingcap/tidb/server.(*clientConn).writeChunks
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:2212
github.com/pingcap/tidb/server.(*clientConn).writeResultset
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:2163
github.com/pingcap/tidb/server.(*clientConn).handleStmt
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:2044
github.com/pingcap/tidb/server.(*clientConn).handleQuery
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:1876
github.com/pingcap/tidb/server.(*clientConn).dispatch
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:1371
github.com/pingcap/tidb/server.(*clientConn).Run
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:1121
github.com/pingcap/tidb/server.(*Server).onConn
    /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:559
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1571"]
MySQL [test]> explain select * from test_json A left join (select * from test_json where false) B on A.addr =B.addr where B.addr ->'$.list[0].id'='0asdsss';
+-------------------------------+---------+-----------+---------------+-------------------------------------------------------------------------------------+
| id                            | estRows | task      | access object | operator info                                                                       |
+-------------------------------+---------+-----------+---------------+-------------------------------------------------------------------------------------+
| Selection_12                  | 0.80    | root      |               | eq(json_extract(test.test_json.addr, "$.list[0].id"), cast("0asdsss", json BINARY)) |
| └─HashJoin_13                 | 1.00    | root      |               | left outer join, equal:[eq(test.test_json.addr, test.test_json.addr)]               |
|   ├─Selection_17(Build)       | 0.00    | root      |               | not(isnull(cast(test.test_json.addr, var_string(4294967295))))                      |
|   │ └─TableDual_18            | 0.00    | root      |               | rows:0                                                                              |
|   └─TableReader_16(Probe)     | 1.00    | root      |               | data:TableFullScan_15                                                               |
|     └─TableFullScan_15        | 1.00    | cop[tikv] | table:A       | keep order:false, stats:pseudo                                                      |
+-------------------------------+---------+-----------+---------------+-------------------------------------------------------------------------------------+
6 rows in set (0.01 sec)

But for the normal one, eq(json_extract(test.test_json.addr, "$.list[0].id"), cast("0asdsss", json BINARY)) is pushed down to tikv.

MySQL [test]> explain select * from test_json A left join (select * from test_json where false) B on A.addr =B.addr where A.addr ->'$.list[0].id'='0asdsss';
+-----------------------------+---------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| id                          | estRows | task      | access object | operator info                                                                                                                                       |
+-----------------------------+---------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| HashJoin_12                 | 0.80    | root      |               | left outer join, equal:[eq(test.test_json.addr, test.test_json.addr)]                                                                               |
| ├─Selection_17(Build)       | 0.00    | root      |               | eq(json_extract(test.test_json.addr, "$.list[0].id"), cast("0asdsss", json BINARY)), not(isnull(cast(test.test_json.addr, var_string(4294967295)))) |
| │ └─TableDual_18            | 0.00    | root      |               | rows:0                                                                                                                                              |
| └─TableReader_16(Probe)     | 0.80    | root      |               | data:Selection_15                                                                                                                                   |
|   └─Selection_15            | 0.80    | cop[tikv] |               | eq(json_extract(test.test_json.addr, "$.list[0].id"), cast("0asdsss", json BINARY))                                                                 |
|     └─TableFullScan_14      | 1.00    | cop[tikv] | table:A       | keep order:false, stats:pseudo                                                                                                                      |
+-----------------------------+---------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
6 rows in set (0.00 sec)
LittleFall commented 2 years ago

Reproduce

minimal reproduce:

create table t(a json);
insert into t values('{"id": "ish"}');
select * from t t1 left join t t2 on t1.a=t2.a where t2.a='ish';
+--------------------------------+---------+-----------+---------------+--------------------------------------------------------------+
| id                             | estRows | task      | access object | operator info                                                |
+--------------------------------+---------+-----------+---------------+--------------------------------------------------------------+
| Selection_8                    | 0.80    | root      |               | eq(test.t.a, cast("ish", json BINARY)) |
| └─HashJoin_9                   | 1.00    | root      |               | left outer join, equal:[eq(test.t.a, test.t.a)]              |
|   ├─TableReader_15(Build)      | 0.80    | root      |               | data:Selection_14                                            |
|   │ └─Selection_14             | 0.80    | cop[tikv] |               | not(isnull(cast(test.t.a, var_string(4294967295))))          |
|   │   └─TableFullScan_13       | 1.00    | cop[tikv] | table:t2      | keep order:false, stats:pseudo                               |
|   └─TableReader_12(Probe)      | 1.00    | root      |               | data:TableFullScan_11                                        |
|     └─TableFullScan_11         | 1.00    | cop[tikv] | table:t1      | keep order:false, stats:pseudo                               |
+--------------------------------+---------+-----------+---------------+--------------------------------------------------------------+
7 rows in set (0.00 sec)

this issue will reproduce when :

  1. plan include pattern eq(some_json, cast("some_string", json))
  2. this pattern is executed by tidb executor (not tikv)
  3. logical optimize rule PredicatePushDown is triggered (usually adding a join will enable it)

Root Cause

actually 0asdsss or ish is not a valid JSON, but "ish" is a valid JSON, we can check it by executing SQL SELECT CAST('ish' AS json); and SELECT CAST('"ish"' AS json);.

mysql> SELECT CAST('"ish"' AS json);
+-----------------------+
| CAST('"ish"' AS json) |
+-----------------------+
| "ish"                 |
+-----------------------+
1 row in set (0.00 sec)

mysql> SELECT CAST('ish' AS json);
ERROR 3140 (22032): Invalid JSON text: The document root must not be followed by other values.

phase planBuilder:

  1. When comparing JSON with string eq(some_json, some_string), tidb will cast this string to JSON first.
  2. Tidb knows that most strings are not valid JSON, so when it detected the compare between string and json, it will remove the flag ParseToJSONFlag so executor won't really parse this string. https://github.com/pingcap/tidb/blob/master/expression/builtin_compare.go#L1741-L1751
    func (c *compareFunctionClass) generateCmpSigs(ctx sessionctx.Context, args []Expression, tp types.EvalType) (sig builtinFunc, err error) {
    if tp == types.ETJson {
        // In compare, if we cast string to JSON, we shouldn't parse it.
        for i := range args {
            DisableParseJSONFlag4Expr(args[i])
        }
    }

phase logicalOptimize

  1. unfortunately, logical optimize rule PredicatePushDown has a potential bug, which will unexpectedly reset the flag of expressions. https://github.com/pingcap/tidb/pull/35759 has more information about it.

phase executor

  1. when executor builtinCastStringAsJSONSig is working, it checks whether the ParseToJSONFlag is set. if not set, it just do a simple binary convert. if set, it will parse ish, so tidb report. https://github.com/pingcap/tidb/blob/master/expression/builtin_cast_vec.go#L787-L810

Fix

https://github.com/pingcap/tidb/pull/35759 will fix this issue.

suggest closing the cherry-picks of this pr, because it has a little effect and difficult trigger condition.