Different result for utf8mb4_unicode_ci between MySQL v8.0.33 and TiDB v8.1.0

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

In TiDB, utf8mb4_unicode_ci works differently from MySQL. For example, when you compare a half-width space (" ") character with a full-width character ("　"), the result is false.

-- TiDB
tidb:4000 > select version();
+--------------------+
| version()          |
+--------------------+
| 8.0.11-TiDB-v8.1.0 |
+--------------------+
1 row in set (0.01 sec)

-- compared with a half-width space with a full-width space
tidb:4000 > select ' ' collate utf8mb4_unicode_ci = '　' collate utf8mb4_unicode_ci from dual;
+-------------------------------------------------------------------+
| ' ' collate utf8mb4_unicode_ci = '　' collate utf8mb4_unicode_ci  |
+-------------------------------------------------------------------+
|                                                                 0 |
+-------------------------------------------------------------------+
1 row in set (0.01 sec)

But, in MySQL, the result is true.

mysql> select version();
--------------
select version()
--------------

+-----------+
| version() |
+-----------+
| 8.0.33    |
+-----------+
1 row in set (0.00 sec)

mysql> source data/check.sql
--------------
select ' ' collate utf8mb4_unicode_ci = '　' collate utf8mb4_unicode_ci from dual
--------------

+-------------------------------------------------------------------+
| ' ' collate utf8mb4_unicode_ci = '　' collate utf8mb4_unicode_ci  |
+-------------------------------------------------------------------+
|                                                                 1 |
+-------------------------------------------------------------------+
1 row in set (0.00 sec)

That's not expected.

2. What did you expect to see? (Required)

TiDB's utf8mb4_unicode_ci should work as same as MySQL.

tidb:4000 > select ' ' collate utf8mb4_unicode_ci = '　' collate utf8mb4_unicode_ci from dual;
+-------------------------------------------------------------------+
| ' ' collate utf8mb4_unicode_ci = '　' collate utf8mb4_unicode_ci  |
+-------------------------------------------------------------------+
|                                                                 1 |
+-------------------------------------------------------------------+

3. What did you see instead (Required)

tidb:4000 > select ' ' collate utf8mb4_unicode_ci = '　' collate utf8mb4_unicode_ci from dual;
+-------------------------------------------------------------------+
| ' ' collate utf8mb4_unicode_ci = '　' collate utf8mb4_unicode_ci  |
+-------------------------------------------------------------------+
|                                                                 0 |
+-------------------------------------------------------------------+

4. What is your TiDB version? (Required)

select version();
+--------------------+
| version()          |
+--------------------+
| 8.0.11-TiDB-v8.1.0 |
+--------------------+
1 row in set (0.01 sec)

Thanks for reporting this.

For space-only comparison, it is related to 'PADDING' things, as explained in https://docs.pingcap.com/tidb/stable/character-set-and-collation:

The implementation of padding in TiDB is different from that in MySQL. In MySQL, padding is implemented by filling in spaces. In TiDB, padding is implemented by cutting out the spaces at the end. The two approaches are the same in most cases. The only exception is when the end of the string contains characters that are less than spaces (0x20). For example, the result of 'a' < 'a\t' in TiDB is 1, but in MySQL, 'a' < 'a\t' is equivalent to 'a ' < 'a\t', and the result is 0.

tidb> select hex(weight_string(' ' collate utf8mb4_unicode_ci));
+----------------------------------------------------+
| hex(weight_string(' ' collate utf8mb4_unicode_ci)) |
+----------------------------------------------------+
|                                                    |
+----------------------------------------------------+
1 row in set (0.00 sec)

tidb> select hex(weight_string('　'　 collate utf8mb4_unicode_ci));
+------------------------------------------------------+
| hex(weight_string('　' collate utf8mb4_unicode_ci))  |
+------------------------------------------------------+
| 0209                                                 |
+------------------------------------------------------+
1 row in set (0.00 sec)

As you see, the weight of ' ' is none because the space is 'cutting out'-ed.

Based on this, we have the correct result for 'a a' and 'a　a':

tidb> select 'a a' = 'a　a' collate utf8mb4_unicode_ci;
+--------------------------------------------+
| 'a a' = 'a　a' collate utf8mb4_unicode_ci  |
+--------------------------------------------+
|                                          1 |
+--------------------------------------------+
1 row in set (0.00 sec)

However, the result is still incorrect for LIKE, the cause is unknown yet:

tidb> select 'a a' = 'a　a' collate utf8mb4_unicode_ci; 
+--------------------------------------------+
| 'a a' = 'a　a' collate utf8mb4_unicode_ci  |
+--------------------------------------------+
|                                          1 |
+--------------------------------------------+
1 row in set (0.00 sec)

DROP TABLE IF EXISTS test;
CREATE TABLE test(id int auto_increment primary key, name varchar(20) collate utf8mb4_unicode_ci);

INSERT INTO test(name) VALUES ('a　a'),('a a');

-- both records should be hit
SELECT * FROM test WHERE name LIKE 'a　a';
-- both records should be hit
SELECT * FROM test WHERE name LIKE 'a a';
-- specifying collate doesn't change the result
SELECT * FROM test WHERE name LIKE 'a　a' collate utf8mb4_unicode_ci;

Thank you for following up. Yes, not only = comparison, but also LIKE is the problem for us. I wish this will be solved somehow🙏

It is a bug of LIKE implementation in TiKV:

Prepare the data:

DROP TABLE IF EXISTS test;
CREATE TABLE test(id int auto_increment primary key, name varchar(20) collate utf8mb4_unicode_ci);

INSERT INTO test(name) VALUES ('a　a'),('a a');

Run with TiKV:

tidb> EXPLAIN SELECT * FROM test WHERE name LIKE 'a　a' collate utf8mb4_unicode_ci;
tidb> explain select * from test where name like 'a　A' collate utf8mb4_unicode_ci;
+-------------------------+----------+-----------+---------------+-------------------------------------+
| id                      | estRows  | task      | access object | operator info                       |
+-------------------------+----------+-----------+---------------+-------------------------------------+
| TableReader_7           | 10.00    | root      |               | data:Selection_6                    |
| └─Selection_6           | 10.00    | cop[tikv] |               | like(test.test.name, "a　A", 92)  |
|   └─TableFullScan_5     | 10000.00 | cop[tikv] | table:test  | keep order:false, stats:pseudo      |
+-------------------------+----------+-----------+---------------+-------------------------------------+

tidb> SELECT * FROM test WHERE name LIKE 'a　a' collate utf8mb4_unicode_ci;
+----+-------+
| id | name  |
+----+-------+
|  1 | a　a  |
+----+-------+

Run with TiFlash:

tidb> set @@session.tidb_isolation_read_engines='tiflash';
Query OK, 0 rows affected (0.00 sec)

tidb> explain select * from test where name like 'a　A' collate utf8mb4_unicode_ci;
+----------------------------+---------+--------------+---------------+----------------------------------------------------------+
| id                         | estRows | task         | access object | operator info                                            |
+----------------------------+---------+--------------+---------------+----------------------------------------------------------+
| TableReader_13             | 0.00    | root         |               | MppVersion: 2, data:ExchangeSender_12                    |
| └─ExchangeSender_12        | 0.00    | mpp[tiflash] |               | ExchangeType: PassThrough                                |
|   └─Selection_11           | 0.00    | mpp[tiflash] |               | like(test.test.name, "a　A", 92)                       |
|     └─TableFullScan_10     | 2.00    | mpp[tiflash] | table:test  | pushed down filter:empty, keep order:false, stats:pseudo |
+----------------------------+---------+--------------+---------------+----------------------------------------------------------+
4 rows in set (0.00 sec)

tidb> select * from test where name like 'a　A' collate utf8mb4_unicode_ci;
+----+-------+
| id | name  |
+----+-------+
|  1 | a　a  |
|  2 | a a   |
+----+-------+
2 rows in set (0.10 sec)

Run with TiDB(by blocking pushdown LIKE to TiKV):

tidb> INSERT INTO mysql.expr_pushdown_blacklist VALUES('LIKE','tikv','');
Query OK, 1 row affected (0.00 sec)

tidb> admin reload expr_pushdown_blacklist;
Query OK, 0 rows affected (0.01 sec)

tidb> explain select * from test where name like 'a　A' collate utf8mb4_unicode_ci;
+-------------------------+---------+-----------+---------------+-------------------------------------+
| id                      | estRows | task      | access object | operator info                       |
+-------------------------+---------+-----------+---------------+-------------------------------------+
| Selection_7             | 0.00    | root      |               | like(test.test.name, "a　A", 92)  |
| └─TableReader_6         | 2.00    | root      |               | data:TableFullScan_5                |
|   └─TableFullScan_5     | 2.00    | cop[tikv] | table:test  | keep order:false, stats:pseudo      |
+-------------------------+---------+-----------+---------------+-------------------------------------+
3 rows in set, 1 warning (0.00 sec)

tidb> select * from test where name like 'a　A' collate utf8mb4_unicode_ci;
+----+-------+
| id | name  |
+----+-------+
|  1 | a　a  |
|  2 | a a   |
+----+-------+
2 rows in set (0.00 sec)

pingcap / tidb