pingcap / tidb

TiDB - the open-source, cloud-native, distributed SQL database designed for modern applications.
https://pingcap.com
Apache License 2.0
37.41k stars 5.85k forks source link

Default collation changes in case sensitivity between TiDB and MySQL #29350

Open dveeden opened 3 years ago

dveeden commented 3 years ago

Bug Report

1. Minimal reproduce step (Required)

TiDB:

TiDB> SELECT VARIABLE_VALUE FROM mysql.tidb WHERE VARIABLE_NAME='new_collation_enabled';
+----------------+
| VARIABLE_VALUE |
+----------------+
| True           |
+----------------+
1 row in set (0.0014 sec)

TiDB> SHOW COLLATION WHERE Charset='utf8mb4' AND `Default`='Yes';
+-------------+---------+----+---------+----------+---------+
| Collation   | Charset | Id | Default | Compiled | Sortlen |
+-------------+---------+----+---------+----------+---------+
| utf8mb4_bin | utf8mb4 | 46 | Yes     | Yes      |       1 |
+-------------+---------+----+---------+----------+---------+
1 row in set (0.0004 sec)

TiDB> SHOW VARIABLES LIKE 'collation\_%';SHOW VARIABLES LIKE 'character\_set\_%';
+----------------------+-------------+
| Variable_name        | Value       |
+----------------------+-------------+
| collation_connection | utf8mb4_bin |
| collation_database   | utf8mb4_bin |
| collation_server     | utf8mb4_bin |
+----------------------+-------------+
3 rows in set (0.0058 sec)
+--------------------------+---------+
| Variable_name            | Value   |
+--------------------------+---------+
| character_set_client     | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database   | utf8mb4 |
| character_set_filesystem | binary  |
| character_set_results    | utf8mb4 |
| character_set_server     | utf8mb4 |
| character_set_system     | utf8    |
+--------------------------+---------+
7 rows in set (0.0029 sec)

TiDB> SELECT 'A' = 'a';
+-----------+
| 'A' = 'a' |
+-----------+
|         0 |
+-----------+
1 row in set (0.0006 sec)

TiDB> SELECT tidb_version()\G
*************************** 1. row ***************************
tidb_version(): Release Version: v5.2.2
Edition: Community
Git Commit Hash: da1c21fd45a4ea5900ac16d2f4a248143f378d18
Git Branch: heads/refs/tags/v5.2.2
UTC Build Time: 2021-10-20 06:08:33
GoVersion: go1.16.4
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.0005 sec)

TiDB> 

MySQL 8.0

mysql> SHOW COLLATION WHERE Charset='utf8mb4' AND `Default`='Yes';
+--------------------+---------+-----+---------+----------+---------+---------------+
| Collation          | Charset | Id  | Default | Compiled | Sortlen | Pad_attribute |
+--------------------+---------+-----+---------+----------+---------+---------------+
| utf8mb4_0900_ai_ci | utf8mb4 | 255 | Yes     | Yes      |       0 | NO PAD        |
+--------------------+---------+-----+---------+----------+---------+---------------+
1 row in set (0.0015 sec)

mysql> SHOW VARIABLES LIKE 'collation\_%';SHOW VARIABLES LIKE 'character\_set\_%';
+----------------------+--------------------+
| Variable_name        | Value              |
+----------------------+--------------------+
| collation_connection | utf8mb4_0900_ai_ci |
| collation_database   | utf8mb4_0900_ai_ci |
| collation_server     | utf8mb4_0900_ai_ci |
+----------------------+--------------------+
3 rows in set (0.0023 sec)
+--------------------------+---------+
| Variable_name            | Value   |
+--------------------------+---------+
| character_set_client     | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database   | utf8mb4 |
| character_set_filesystem | binary  |
| character_set_results    | utf8mb4 |
| character_set_server     | utf8mb4 |
| character_set_system     | utf8mb3 |
+--------------------------+---------+
7 rows in set (0.0016 sec)

mysql> SELECT 'A' = 'a';
+-----------+
| 'A' = 'a' |
+-----------+
|         1 |
+-----------+
1 row in set (0.0004 sec)

mysql> SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 8.0.27    |
+-----------+
1 row in set (0.0003 sec)

MySQL 5.7:

mysql> SHOW COLLATION WHERE Charset='utf8mb4' AND `Default`='Yes';
+--------------------+---------+----+---------+----------+---------+
| Collation          | Charset | Id | Default | Compiled | Sortlen |
+--------------------+---------+----+---------+----------+---------+
| utf8mb4_general_ci | utf8mb4 | 45 | Yes     | Yes      |       1 |
+--------------------+---------+----+---------+----------+---------+
1 row in set (0.0007 sec)

mysql> SHOW VARIABLES LIKE 'collation\_%';SHOW VARIABLES LIKE 'character\_set\_%';
+----------------------+--------------------+
| Variable_name        | Value              |
+----------------------+--------------------+
| collation_connection | utf8mb4_general_ci |
| collation_database   | latin1_swedish_ci  |
| collation_server     | latin1_swedish_ci  |
+----------------------+--------------------+
3 rows in set (0.0029 sec)
+--------------------------+---------+
| Variable_name            | Value   |
+--------------------------+---------+
| character_set_client     | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database   | latin1  |
| character_set_filesystem | binary  |
| character_set_results    | utf8mb4 |
| character_set_server     | latin1  |
| character_set_system     | utf8    |
+--------------------------+---------+
7 rows in set (0.0013 sec)

mysql> SELECT 'A' = 'a';
+-----------+
| 'A' = 'a' |
+-----------+
|         1 |
+-----------+
1 row in set (0.0004 sec)

mysql> SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 5.7.31    |
+-----------+
1 row in set (0.0004 sec)

2. What did you expect to see? (Required)

Similar behavior between TiDB and MySQL.

3. What did you see instead (Required)

The default collation behaves case sensitive (binary) for TiDB.

Workaround

Using SET NAMES and/or charset/collation settings in the connection properties when connecting.

TiDB> SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
Query OK, 0 rows affected (0.0004 sec)

TiDB> SELECT 'A' = 'a';
+-----------+
| 'A' = 'a' |
+-----------+
|         1 |
+-----------+
1 row in set (0.0003 sec)
dveeden commented 3 years ago

/type compatibility

dveeden commented 3 years ago

Different client versions behave differently. This might be because some of them might issue a SET NAMES utf8mb4.

$ for m in ./opt/mysql/*/bin/mysql; do $m --version; $m -u root -h 127.0.0.1 -P 4000 -BNe "SELECT @@session.collation_connection"; done
./opt/mysql/5.1.73/bin/mysql  Ver 14.14 Distrib 5.1.73, for unknown-linux-gnu (x86_64) using readline 5.1
utf8mb4_bin
./opt/mysql/5.7.31/bin/mysql  Ver 14.14 Distrib 5.7.31, for linux-glibc2.12 (x86_64) using  EditLine wrapper
utf8_general_ci
./opt/mysql/8.0.22/bin/mysql  Ver 8.0.22 for Linux on x86_64 (MySQL Community Server - GPL)
utf8mb4_bin
./opt/mysql/8.0.26/bin/mysql  Ver 8.0.26 for Linux on x86_64 (MySQL Community Server - GPL)
utf8mb4_bin
./opt/mysql/8.0.27/bin/mysql  Ver 8.0.27 for Linux on x86_64 (MySQL Community Server - GPL)
utf8mb4_bin
wjhuang2016 commented 3 years ago

I think it's documented in https://docs.pingcap.com/tidb/stable/mysql-compatibility#default-differences ?

dveeden commented 3 years ago

It indeed is documented, but as this is causing problems for applications it would be good if we can change this to be more compatible.

wjhuang2016 commented 3 years ago

But it will break backward compatibility, I think you can talk to our PM. It's not a technical problem, it's a product problem.

bb7133 commented 3 years ago

IMHO in order to archive 'default case-insensitive collation', we need to set new collation enabled by default firstly. @wjhuang2016 Are you interested in this work?