pingcap / parser

A MySQL Compatible SQL Parser
Apache License 2.0
1.41k stars 489 forks source link

support parsing SQL with encodings other than utf8 #1312

Closed tangenta closed 3 years ago

tangenta commented 3 years ago

What problem does this PR solve?

Related to https://github.com/pingcap/tidb/issues/26812

What is changed and how it works?

Add field encoding.Decoder to the parser config.

Check List

Tests

Code changes

Side effects

NA

Related changes

NA

ti-chi-bot commented 3 years ago

[REVIEW NOTIFICATION]

This pull request has been approved by:

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment. After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review. Reviewer can cancel approval by submitting a request changes review.
xiongjiwei commented 3 years ago

https://github.com/pingcap/parser/blob/562fed23b4fb6fffe012f8c3d46d8adf5c3ac744/lexer.go#L900-L903

peek is widely used in lexer, such as incAsLongAs. If the SQL is gbk encoding, there may be an error to decode it as utf8 encoding?

tangenta commented 3 years ago

@xiongjiwei If there is an utf8 decoding error, the bytes are interpreted into a four-byte integer. We can still decode it into gbk later because the information is not lost.

OK.

kennytm commented 3 years ago

please include unit tests for:

1.

This should pass

-- in GBK encoding
select '芢' from `玚`;

Equivalently as Go string

// charset=gbk
"select '\xc6\x5c' from `\xab\x60`;"
$ echo $'select "\xc6\x5c" from `\xab\x60`;' | mysql -u root test --default-character-set=gbk
芢
芢

2.

This should fail

// charset=utf8mb4
"select _gbk'\xc6\x5c' from dual;"
$ echo $'select _gbk"\xc6\x5c" from dual;' | mysql -u root test --default-character-set=utf8mb4
ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '"?\" from dual' at line 1
tangenta commented 3 years ago
var utf8SQL = "create table 测试表 (测试列 varchar(255) default 'GBK测试用例测试用例测试用例测试用例测试用例测试用例测试用例测试用例');"
var gbkSQL string

func init() {
    encoding, _ := charset.Lookup("gbk")
    gbkSQL, _ = encoding.NewEncoder().String(utf8SQL)
}

func BenchmarkParserParseGBK(b *testing.B) {
    p := New()
    for i := 0; i < b.N; i++ {
        _, _, _ = p.Parse(gbkSQL, "gbk", "")
    }
}

func BenchmarkParserParseUTF8(b *testing.B) {
    p := New()
    for i := 0; i < b.N; i++ {
        _, _, _ = p.Parse(utf8SQL, "", "")
    }
}
goos: linux
goarch: amd64
pkg: github.com/pingcap/parser
cpu: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
BenchmarkParserParseGBK-12         50840             23212 ns/op
BenchmarkParserParseUTF8-12        81193             14864 ns/op
PASS
ok      github.com/pingcap/parser       2.799s
tangenta commented 3 years ago

/hold

tangenta commented 3 years ago

/unhold

zimulala commented 3 years ago

@xiongjiwei Do you need another look?

zimulala commented 3 years ago

/merge

ti-chi-bot commented 3 years ago

This pull request has been accepted and is ready to merge.

Commit hash: e62156e1891d07af55645cdf93bf5ee3dcb34334

kennytm commented 3 years ago

why is the Circle CI requirement still there?

zimulala commented 3 years ago

/merge

ti-chi-bot commented 3 years ago

This pull request has been accepted and is ready to merge.

Commit hash: fe4f5320db14dd6601296561f9dd5e7b20eb8c37