Closed sgykfjsm closed 1 month ago
Thanks for reporting this.
For space-only comparison, it is related to 'PADDING' things, as explained in https://docs.pingcap.com/tidb/stable/character-set-and-collation:
The implementation of padding in TiDB is different from that in MySQL. In MySQL, padding is implemented by filling in spaces. In TiDB, padding is implemented by cutting out the spaces at the end. The two approaches are the same in most cases. The only exception is when the end of the string contains characters that are less than spaces (0x20). For example, the result of 'a' < 'a\t' in TiDB is 1, but in MySQL, 'a' < 'a\t' is equivalent to 'a ' < 'a\t', and the result is 0.
tidb> select hex(weight_string(' ' collate utf8mb4_unicode_ci));
+----------------------------------------------------+
| hex(weight_string(' ' collate utf8mb4_unicode_ci)) |
+----------------------------------------------------+
| |
+----------------------------------------------------+
1 row in set (0.00 sec)
tidb> select hex(weight_string(' ' collate utf8mb4_unicode_ci));
+------------------------------------------------------+
| hex(weight_string(' ' collate utf8mb4_unicode_ci)) |
+------------------------------------------------------+
| 0209 |
+------------------------------------------------------+
1 row in set (0.00 sec)
As you see, the weight of ' ' is none because the space is 'cutting out'-ed.
Based on this, we have the correct result for 'a a' and 'a a':
tidb> select 'a a' = 'a a' collate utf8mb4_unicode_ci;
+--------------------------------------------+
| 'a a' = 'a a' collate utf8mb4_unicode_ci |
+--------------------------------------------+
| 1 |
+--------------------------------------------+
1 row in set (0.00 sec)
However, the result is still incorrect for LIKE
, the cause is unknown yet:
tidb> select 'a a' = 'a a' collate utf8mb4_unicode_ci;
+--------------------------------------------+
| 'a a' = 'a a' collate utf8mb4_unicode_ci |
+--------------------------------------------+
| 1 |
+--------------------------------------------+
1 row in set (0.00 sec)
DROP TABLE IF EXISTS test;
CREATE TABLE test(id int auto_increment primary key, name varchar(20) collate utf8mb4_unicode_ci);
INSERT INTO test(name) VALUES ('a a'),('a a');
-- both records should be hit
SELECT * FROM test WHERE name LIKE 'a a';
-- both records should be hit
SELECT * FROM test WHERE name LIKE 'a a';
-- specifying collate doesn't change the result
SELECT * FROM test WHERE name LIKE 'a a' collate utf8mb4_unicode_ci;
Thank you for following up. Yes, not only =
comparison, but also LIKE
is the problem for us. I wish this will be solved somehow🙏
It is a bug of LIKE
implementation in TiKV:
Prepare the data:
DROP TABLE IF EXISTS test;
CREATE TABLE test(id int auto_increment primary key, name varchar(20) collate utf8mb4_unicode_ci);
INSERT INTO test(name) VALUES ('a a'),('a a');
Run with TiKV:
tidb> EXPLAIN SELECT * FROM test WHERE name LIKE 'a a' collate utf8mb4_unicode_ci;
tidb> explain select * from test where name like 'a A' collate utf8mb4_unicode_ci;
+-------------------------+----------+-----------+---------------+-------------------------------------+
| id | estRows | task | access object | operator info |
+-------------------------+----------+-----------+---------------+-------------------------------------+
| TableReader_7 | 10.00 | root | | data:Selection_6 |
| └─Selection_6 | 10.00 | cop[tikv] | | like(test.test.name, "a A", 92) |
| └─TableFullScan_5 | 10000.00 | cop[tikv] | table:test | keep order:false, stats:pseudo |
+-------------------------+----------+-----------+---------------+-------------------------------------+
tidb> SELECT * FROM test WHERE name LIKE 'a a' collate utf8mb4_unicode_ci;
+----+-------+
| id | name |
+----+-------+
| 1 | a a |
+----+-------+
Run with TiFlash:
tidb> set @@session.tidb_isolation_read_engines='tiflash';
Query OK, 0 rows affected (0.00 sec)
tidb> explain select * from test where name like 'a A' collate utf8mb4_unicode_ci;
+----------------------------+---------+--------------+---------------+----------------------------------------------------------+
| id | estRows | task | access object | operator info |
+----------------------------+---------+--------------+---------------+----------------------------------------------------------+
| TableReader_13 | 0.00 | root | | MppVersion: 2, data:ExchangeSender_12 |
| └─ExchangeSender_12 | 0.00 | mpp[tiflash] | | ExchangeType: PassThrough |
| └─Selection_11 | 0.00 | mpp[tiflash] | | like(test.test.name, "a A", 92) |
| └─TableFullScan_10 | 2.00 | mpp[tiflash] | table:test | pushed down filter:empty, keep order:false, stats:pseudo |
+----------------------------+---------+--------------+---------------+----------------------------------------------------------+
4 rows in set (0.00 sec)
tidb> select * from test where name like 'a A' collate utf8mb4_unicode_ci;
+----+-------+
| id | name |
+----+-------+
| 1 | a a |
| 2 | a a |
+----+-------+
2 rows in set (0.10 sec)
Run with TiDB(by blocking pushdown LIKE
to TiKV):
tidb> INSERT INTO mysql.expr_pushdown_blacklist VALUES('LIKE','tikv','');
Query OK, 1 row affected (0.00 sec)
tidb> admin reload expr_pushdown_blacklist;
Query OK, 0 rows affected (0.01 sec)
tidb> explain select * from test where name like 'a A' collate utf8mb4_unicode_ci;
+-------------------------+---------+-----------+---------------+-------------------------------------+
| id | estRows | task | access object | operator info |
+-------------------------+---------+-----------+---------------+-------------------------------------+
| Selection_7 | 0.00 | root | | like(test.test.name, "a A", 92) |
| └─TableReader_6 | 2.00 | root | | data:TableFullScan_5 |
| └─TableFullScan_5 | 2.00 | cop[tikv] | table:test | keep order:false, stats:pseudo |
+-------------------------+---------+-----------+---------------+-------------------------------------+
3 rows in set, 1 warning (0.00 sec)
tidb> select * from test where name like 'a A' collate utf8mb4_unicode_ci;
+----+-------+
| id | name |
+----+-------+
| 1 | a a |
| 2 | a a |
+----+-------+
2 rows in set (0.00 sec)
Ref https://github.com/tikv/tikv/issues/17332
It has been fixed.
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
In TiDB,
utf8mb4_unicode_ci
works differently from MySQL. For example, when you compare a half-width space (" ") character with a full-width character (" "), the result is false.But, in MySQL, the result is true.
That's not expected.
2. What did you expect to see? (Required)
TiDB's utf8mb4_unicode_ci should work as same as MySQL.
3. What did you see instead (Required)
4. What is your TiDB version? (Required)