pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
36.86k stars 5.8k forks source link

The result of Upper() and Lower() are different with MySQL. #32513

Open Defined2014 opened 2 years ago

Defined2014 commented 2 years ago

Bug Report

1. Minimal reproduce step (Required)

create table t (a char(10) collate utf8mb4_bin);
insert into t values ('ꮑ');
select upper(a) from t;

2. What did you expect to see? (Required)

mysql> select upper(a) from t;
+----------+
| upper(a) |
+----------+
| ꮑ        |
+----------+
1 row in set (0.00 sec)

3. What did you see instead (Required)

mysql> select upper(a) from t;
+----------+
| upper(a) |
+----------+
| Ꮑ        |
+----------+
1 row in set (0.00 sec)

4. What is your TiDB version? (Required)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tidb_version()                                                                                                                                                                                                                                                                                                            |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Release Version: v5.5.0-alpha-381-g2ede6a35b
Edition: Community
Git Commit Hash: 2ede6a35b83832b149ba5ce026680a18e9518ee7
Git Branch: master
UTC Build Time: 2022-02-21 10:28:59
GoVersion: go1.17.2
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
Defined2014 commented 2 years ago

/severity minor

Defined2014 commented 2 years ago

The results of upper() and lower() in MySQL are related to collation.

mysql> create table t (a char(10) collate utf8mb4_bin, b char(10) collate utf8mb4_0900_ai_ci);
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t values ('ꮑ','ꮑ');
Query OK, 1 row affected (0.00 sec)

mysql> select upper(a), upper(b) from t;
+----------+----------+
| upper(a) | upper(b) |
+----------+----------+
| ꮑ        | Ꮑ        |
+----------+----------+
1 row in set (0.00 sec)

I wrote a tiny program to find all inconsistent characters.

package main

import (
    "context"
    "database/sql"
    "fmt"
    _ "github.com/go-sql-driver/mysql"
    "unicode"
)

func main() {
    ctx := context.Background()
    tiDb, err := sql.Open("mysql", "root:@tcp(127.0.0.1:4000)/test")
    if err != nil {
        panic(err.Error())
    }
    tidbC, err := tiDb.Conn(ctx)
    if err != nil {
        panic(err.Error())
    }

    mySQL, err := sql.Open("mysql", "root:@tcp(127.0.0.1:3306)/test")
    if err != nil {
        panic(err.Error())
    }
    mysqlC, err := mySQL.Conn(ctx)
    if err != nil {
        panic(err.Error())
    }

    // _, _ = mysqlC.ExecContext(ctx, "set names utf8mb4")
    // _, _ = tidbC.ExecContext(ctx, "set names utf8mb4")

    sum := 0
    for i:= unicode.MaxASCII + 1; i <= 125251; i++ {
        sql := fmt.Sprintf("select hex(upper('%s'));", string(rune(i)))
        tiRs, err := tidbC.QueryContext(ctx, sql)
        if err != nil {
            panic(err.Error())
        }
        var tiS, mySqlS string
        for tiRs.Next() {
            if err := tiRs.Scan(&tiS); err != nil {
                panic(err.Error())
            }
        }
        tiRs.Close()

        mysqlRs, err := mysqlC.QueryContext(ctx, sql)
        if err != nil {
            panic(err.Error())
        }
        for mysqlRs.Next() {
            if err := mysqlRs.Scan(&mySqlS); err != nil {
                panic(err.Error())
            }
        }
        mysqlRs.Close()

        if mySqlS != tiS {
            a := []byte(string(i))
            v := 0
            for j:=0; j < len(a); j ++ {
                v *= 256
                v += int(a[j])
            }
            vS := fmt.Sprintf("%X", v)
            if vS != mySqlS {
                fmt.Printf("rune: %c, unicode: %d, origin charset: 0x%X, mysql upper: 0x%s, tidb upper: 0x%s\n", i, i, v, mySqlS, tiS)
            } else {
                fmt.Printf("rune: %c, unicode: %d, mysql upper: 0x%s, tidb upper: 0x%s\n", i, i, mySqlS, tiS)
            }
            sum ++
        }
    }

    fmt.Printf("total: %d\n", sum)
}
YangKeao commented 1 year ago

The different collations can use different version of unicodes (and UCA), which include different ranges of characters. TiDB uses the golang library, whose unicode version is 15.0.0 (for go 1.21.0). It's significantly greater than any existing MySQL collation (utf8mb4_general_ci uses 4.0.0, and utf8mb4_0900_ai_ci uses 9.0.0).

A more serious problem is that the result can be incompatible between TiDB, TiKV and TiFlash.

Component Unicode Version
TiDB 15.0 (Golang 1.21.0)
TiKV 14.0 (nightly-2022-11-15)
TiFlash 7.0 (pingcap/poco@170872), which uses PCRE 8.4.0

Rust declares itself to have upgraded to unicode 15.0, but I found the case map isn't upgraded. Maybe there is no difference between 14.0 and 15.0 on cases (but I'm not 100% sure).

The TiFlash definitely can have different behavior. I'll try to construct some examples to show the difference.

YangKeao commented 1 year ago

We could solve it by using the case map from MySQL (or generate from UCD by ourselves), but need some tiny refractor to use different case map for different collations.

windtalker commented 10 months ago

The different collations can use different version of unicodes (and UCA), which include different ranges of characters. TiDB uses the golang library, whose unicode version is 15.0.0 (for go 1.21.0). It's significantly greater than any existing MySQL collation (utf8mb4_general_ci uses 4.0.0, and utf8mb4_0900_ai_ci uses 9.0.0).

A more serious problem is that the result can be incompatible between TiDB, TiKV and TiFlash.

Component Unicode Version TiDB 15.0 (Golang 1.21.0) TiKV 14.0 (nightly-2022-11-15) TiFlash 7.0 (pingcap/poco@170872), which uses PCRE 8.4.0 Rust declares itself to have upgraded to unicode 15.0, but I found the case map isn't upgraded. Maybe there is no difference between 14.0 and 15.0 on cases (but I'm not 100% sure).

The TiFlash definitely can have different behavior. I'll try to construct some examples to show the difference.

TiFlash does not use poco, instead, TiFlash translate the go lib function into c++: https://github.com/pingcap/tiflash/blob/e597b78a6f5595ba50ffe839daca798ced1d1bc4/dbms/src/Functions/CharUtil.h#L370