Open xiongjiwei opened 3 years ago
in tidb, a lot of builtin string functions use string or rune in golang directly, such as reverse function: https://github.com/pingcap/tidb/blob/c709217ff49a7bf13a785274abc9b6235bd4611b/expression/builtin_string.go#L797-L798
string
rune
reverse
it works very well if the string is encoding with a valid utf8 sequence. However, it cannot be promised, we can use concat function to builds an invalid utf8 sequence:
concat
mysql> select concat('a' collate utf8mb4_unicode_ci, 0x80); +----------------------------------------------+ | concat('a' collate utf8mb4_unicode_ci, 0x80) | +----------------------------------------------+ | a� | +----------------------------------------------+ 1 row in set (0.00 sec)
then it will have the problem:
-- MySQL mysql> select hex(reverse(concat('a' collate utf8mb4_unicode_ci, 0x80))); +------------------------------------------------------------+ | hex(reverse(concat('a' collate utf8mb4_unicode_ci, 0x80))) | +------------------------------------------------------------+ | 8061 | +------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> select substring(concat('a' collate utf8mb4_bin, 0x80), 2) in (0x80); +---------------------------------------------------------------+ | substring(concat('a' collate utf8mb4_bin, 0x80), 2) in (0x80) | +---------------------------------------------------------------+ | 1 | +---------------------------------------------------------------+ 1 row in set (0.01 sec) mysql> select concat('a' collate utf8mb4_bin, 0x80) like concat('_', 0x90); +--------------------------------------------------------------+ | concat('a' collate utf8mb4_bin, 0x80) like concat('_', 0x90) | +--------------------------------------------------------------+ | 0 | +--------------------------------------------------------------+ 1 row in set (0.00 sec) -- tidb mysql> select hex(reverse(concat('a' collate utf8mb4_unicode_ci, 0x80))); +------------------------------------------------------------+ | hex(reverse(concat('a' collate utf8mb4_unicode_ci, 0x80))) | +------------------------------------------------------------+ | EFBFBD61 | +------------------------------------------------------------+ 1 row in set (0.01 sec) mysql> select substring(concat('a' collate utf8mb4_bin, 0x80), 2) in (0xEFBFBD); +-------------------------------------------------------------------+ | substring(concat('a' collate utf8mb4_bin, 0x80), 2) in (0xEFBFBD) | +-------------------------------------------------------------------+ | 1 | +-------------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> select concat('a' collate utf8mb4_bin, 0x80) like concat('_', 0x90); +--------------------------------------------------------------+ | concat('a' collate utf8mb4_bin, 0x80) like concat('_', 0x90) | +--------------------------------------------------------------+ | 1 | +--------------------------------------------------------------+ 1 row in set (0.00 sec)
it will be a problem in some corner cases. Almost all the functions that accept string type parameters will be affected
How we handle this problem will affect how we implement charset support in tidb
Just want to mention that, our metadata has similar problems, check #23865.
it happens on MySQL version 8.0.22. It is a bug that MySQL has been fixed in 8.0.24
in tidb, a lot of builtin string functions use
string
orrune
in golang directly, such asreverse
function: https://github.com/pingcap/tidb/blob/c709217ff49a7bf13a785274abc9b6235bd4611b/expression/builtin_string.go#L797-L798it works very well if the string is encoding with a valid utf8 sequence. However, it cannot be promised, we can use
concat
function to builds an invalid utf8 sequence:then it will have the problem:
it will be a problem in some corner cases. Almost all the functions that accept
string
type parameters will be affectedHow we handle this problem will affect how we implement charset support in tidb