microsoft / go-mssqldb

Microsoft SQL server driver written in go language
BSD 3-Clause "New" or "Revised" License
279 stars 62 forks source link

decodeChar CharsetToUTF8 return error data for SQL_Latin1_General_CP1_CI_AS chinese code #213

Open travelliu opened 1 month ago

travelliu commented 1 month ago

Describe the bug insert into some chinese code return data error

To Reproduce

SELECT
SERVERPROPERTY('ProductVersion ') AS ProductVersion
,SERVERPROPERTY('ProductLevel') AS ProductLevel
,SERVERPROPERTY('ResourceVersion') AS ResourceVersion
,SERVERPROPERTY('ResourceLastUpdateDateTime') AS ResourceLastUpdateDateTime
,SERVERPROPERTY('Collation') AS Collation
,@@version as Version
,@@LANGUAGE AS Language
,collation_name from sys.databases where name = db_name()
ProductVersion|ProductLevel|ResourceVersion|ResourceLastUpdateDateTime|Collation                   |Version                                                                                                                                                                                              |Language  |collation_name              |
--------------+------------+---------------+--------------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+----------------------------+
16.0.4075.1   |RTM         |16.00.4075     |2023-08-23 14:45:27.7     |SQL_Latin1_General_CP1_CI_AS|Microsoft SQL Server 2022 (RTM-CU8) (KB5029666) - 16.0.4075.1 (X64) ¶ Aug 23 2023 14:04:50 ¶ Copyright (C) 2022 Microsoft Corporation¶ Developer Edition (64-bit) on Linux (Ubuntu 20.04.6 LTS) <X64>|us_english|SQL_Latin1_General_CP1_CI_AS|
            s := "福建九州通譚嵄医疗器械有限公司泉州分公司"
            gbkEnc, err := ianaindex.MIB.Encoding("GBK")
            if err != nil {
                return
            }
            tmp, err := ioutil.ReadAll(
                transform.NewReader(bytes.NewReader([]byte(s)), gbkEnc.NewEncoder()),
            )
            fmt.Println("GBK  ",hex.EncodeToString(tmp))
            iso88591, err := ianaindex.MIB.Encoding("ISO-8859-1")
            if err != nil {
                return
            }
            tmp, err = ioutil.ReadAll(
                transform.NewReader(bytes.NewReader(tmp), iso88591.NewDecoder()),
            )
            fmt.Println("ISO-8859-1  ",hex.EncodeToString(tmp))

            _,err = m.DB.Exec("drop table table dbo.yp_supplyer")
            _,err = m.DB.Exec("create table dbo.yp_supplyer(supplyer_code bigint,supplyer_name char(100))")
            _,err = m.DB.Exec("insert into dbo.yp_supplyer(supplyer_code,supplyer_name ) values(7,@p1)",string(tmp))

            rows, err := m.DB.Query(
                "SELECT supplyer_code,supplyer_name from dbo.yp_supplyer",
            )
            if err != nil {
                t.Error(err)
                return
            }
            defer rows.Close()
            for rows.Next() {
                var (
                    ColTime string
                    supplyer_name []byte
                )
                if err = rows.Scan(
                    &ColTime, &supplyer_name,
                ); err != nil {
                    t.Error(err)
                    return
                }
                fmt.Println("R1 ISO-8859-1  ",hex.EncodeToString(supplyer_name))
                tmp, err = ioutil.ReadAll(
                    transform.NewReader(bytes.NewReader(supplyer_name), iso88591.NewEncoder()),
                )
                fmt.Println("R2 ISO-8859-1  ",hex.EncodeToString(tmp))
            }

GBK   b8a3bda8bec5d6ddcda8d7548db1d2bdc1c6c6f7d0b5d3d0cfdeb9abcbbec8aad6ddb7d6b9abcbbe
ISO-8859-1   c2b8c2a3c2bdc2a8c2bec385c396c39dc38dc2a8c39754c28dc2b1c392c2bdc381c386c386c3b7c390c2b5c393c390c38fc39ec2b9c2abc38bc2bec388c2aac396c39dc2b7c396c2b9c2abc38bc2be
a1 372020202020
a1 b8a3bda8bec5d6ddcda8d7548db1d2bdc1c6c6f7d0b5d3d0cfdeb9abcbbec8aad6ddb7d6b9abcbbe20202020202020202020 --> database return client gbk code
R1 ISO-8859-1   c2b8c2a3c2bdc2a8c2bec385c396c39dc38dc2a8c39754efbfbdc2b1c392c2bdc381c386c386c3b7c390c2b5c393c390c38fc39ec2b9c2abc38bc2bec388c2aac396c39dc2b7c396c2b9c2abc38bc2be20202020202020202020 -- CharsetToUTF8  return 
R2 ISO-8859-1   b8a3bda8bec5d6ddcda8d754

Expected behavior A clear and concise description of what you expected to happen.

Further technical details

SQL Server version: (e.g. SQL Server 2017) Operating system: (e.g. Windows 2019, Ubuntu 18.04, macOS 10.13, Docker container) Table schema

Additional context Add any other context about the problem here.

shueybubbles commented 1 month ago

thx for opening an issue! @travelliu does the insert statement put the right string into the database column, and it's just being scanned incorrectly into the []byte slice?

shueybubbles commented 1 month ago

The driver sends over all Go strings as UTF16, so the engine is going to convert them to whatever the collation of your column is. I think you need to specify a Chinese collation like Chinese_Simplified_Pinyin_100_CI_AI on the column itself, then you can send over regular Go strings with no need for explicit GBK or ISO encoding.

 connString := fmt.Sprintf("server=%s;user id=%s;password=%s;port=%d", server, user, password, port)

     conn, err := sql.Open("mssql", connString)

     _,err = conn.Exec("drop table if exists dbo.yp_supplier")

     _,err = conn.Exec("create table dbo.yp_supplier(supplier_code bigint, supplier_name char(100) collate Chinese_Simplified_Pinyin_100_CI_AI)")

     _,err = conn.Exec("insert into dbo.yp_supplier(supplier_code, supplier_name ) values(7, '福建九州通譚嵄医疗器械有限公司泉州分公司')")

     rows, err := conn.Query(

           "SELECT supplier_name from dbo.yp_supplier",

     )

     defer rows.Close()

     for rows.Next() {

           var supplier_name string

           if err = rows.Scan(

                &supplier_name,

           ); err != nil {

                log.Fatal(err)

                return

           }

     }

a12ec322-77f4-43e2-bee4-a48872bed4b9

travelliu commented 1 month ago

@shueybubbles It's a historical problem, and now it's about migrating the data. But find out the wrong data. The main problem is CharsetToUTF8. Is it possible to provide a parameter that does not automatically convert to UTF8, and use the transcoding process by yourself after getting the data, or provide a function to register the transcoding method?

like pyodbc https://github.com/mkleehammer/pyodbc/wiki/Unicode#microsoft-sql-server image

shueybubbles commented 1 month ago

i think there are separate issues here. For one, the collation of the column in the database should match the data. Otherwise, performance is going to be poor and queries from any driver risk getting back bad data. Two, if you just want to read the raw bytes from the column you can cast it to a binary type in the query, like

select  convert(varbinary(max),'abc') as binarychar

binarychar
0x616263

Then you can decode the bytes on the client however you want.

Lastly, go-mssqldb does have a VarChar type you can use to insert data. It just converts the parameter value to a []byte and passes it along to the server.