pingcap / tidb

TiDB - the open-source, cloud-native, distributed SQL database designed for modern applications.
https://pingcap.com
Apache License 2.0
37.35k stars 5.85k forks source link

some string functions has compatibility issues with mysql #23984

Open xiongjiwei opened 3 years ago

xiongjiwei commented 3 years ago

in tidb, a lot of builtin string functions use string or rune in golang directly, such as reverse function: https://github.com/pingcap/tidb/blob/c709217ff49a7bf13a785274abc9b6235bd4611b/expression/builtin_string.go#L797-L798

it works very well if the string is encoding with a valid utf8 sequence. However, it cannot be promised, we can use concat function to builds an invalid utf8 sequence:

mysql> select concat('a' collate utf8mb4_unicode_ci, 0x80);
+----------------------------------------------+
| concat('a' collate utf8mb4_unicode_ci, 0x80) |
+----------------------------------------------+
| a�                                            |
+----------------------------------------------+
1 row in set (0.00 sec)

then it will have the problem:

-- MySQL
mysql> select hex(reverse(concat('a' collate utf8mb4_unicode_ci, 0x80)));
+------------------------------------------------------------+
| hex(reverse(concat('a' collate utf8mb4_unicode_ci, 0x80))) |
+------------------------------------------------------------+
| 8061                                                       |
+------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> select substring(concat('a' collate utf8mb4_bin, 0x80), 2) in (0x80);
+---------------------------------------------------------------+
| substring(concat('a' collate utf8mb4_bin, 0x80), 2) in (0x80) |
+---------------------------------------------------------------+
|                                                             1 |
+---------------------------------------------------------------+
1 row in set (0.01 sec)
mysql> select concat('a' collate utf8mb4_bin, 0x80) like concat('_', 0x90);
+--------------------------------------------------------------+
| concat('a' collate utf8mb4_bin, 0x80) like concat('_', 0x90) |
+--------------------------------------------------------------+
|                                                            0 |
+--------------------------------------------------------------+
1 row in set (0.00 sec)

-- tidb
mysql> select hex(reverse(concat('a' collate utf8mb4_unicode_ci, 0x80)));
+------------------------------------------------------------+
| hex(reverse(concat('a' collate utf8mb4_unicode_ci, 0x80))) |
+------------------------------------------------------------+
| EFBFBD61                                                   |
+------------------------------------------------------------+
1 row in set (0.01 sec)
mysql> select substring(concat('a' collate utf8mb4_bin, 0x80), 2) in (0xEFBFBD);
+-------------------------------------------------------------------+
| substring(concat('a' collate utf8mb4_bin, 0x80), 2) in (0xEFBFBD) |
+-------------------------------------------------------------------+
|                                                                 1 |
+-------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> select concat('a' collate utf8mb4_bin, 0x80) like concat('_', 0x90);
+--------------------------------------------------------------+
| concat('a' collate utf8mb4_bin, 0x80) like concat('_', 0x90) |
+--------------------------------------------------------------+
|                                                            1 |
+--------------------------------------------------------------+
1 row in set (0.00 sec)

it will be a problem in some corner cases. Almost all the functions that accept string type parameters will be affected

How we handle this problem will affect how we implement charset support in tidb

xhebox commented 3 years ago

Just want to mention that, our metadata has similar problems, check #23865.

xiongjiwei commented 3 years ago

it happens on MySQL version 8.0.22. It is a bug that MySQL has been fixed in 8.0.24