golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.98k stars 17.54k forks source link

x/text: incorrectly decodes all GBK/GB18030 double-byte PUA characters to U+FFFD #62063

Open kennytm opened 1 year ago

kennytm commented 1 year ago

What version of Go are you using (go version)?

$ go version
go version go1.21.0 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

Playground.

What did you do?

https://go.dev/play/p/v_4hT9WSD7_y

package main

import (
    "fmt"

    "golang.org/x/text/encoding/simplifiedchinese"
)

func main() {
    // test decoding of GB18030 PUA characters to UTF-8
    s1, err := simplifiedchinese.GB18030.NewDecoder().Bytes([]byte("\xaa\xa1###\xaf\xfe###\xf8\xa1###\xfe\xfe###\xa1\x40###\xa7\xa0"))
    fmt.Printf("s1/ %x / %v\n", s1, err)
    // test decoding of GBK PUA characters to UTF-8
    s2, err := simplifiedchinese.GBK.NewDecoder().Bytes([]byte("\xaa\xa1###\xaf\xfe###\xf8\xa1###\xfe\xfe###\xa1\x40###\xa7\xa0"))
    fmt.Printf("s2/ %x / %v\n", s2, err)
    // test encoding of UTF-8 PUA characters to GB18030
    s3, err := simplifiedchinese.GB18030.NewEncoder().Bytes([]byte("\ue000###\ue233###\ue234###\ue4c5###\ue4c6###\ue765"))
    fmt.Printf("s3/ %x / %v\n", s3, err)
    // test encoding of UTF-8 PUA characters to GBK
    s4, err := simplifiedchinese.GBK.NewEncoder().Bytes([]byte("\ue000###\ue233###\ue234###\ue4c5###\ue4c6###\ue765"))
    fmt.Printf("s4/ %x / %v\n", s4, err)
}

What did you expect to see?

s1/ ee8080232323ee88b3232323ee88b4232323ee9385232323ee9386232323ee9da5 / <nil>
s2/ ee8080232323ee88b3232323ee88b4232323ee9385232323ee9386232323ee9da5 / <nil>
s3/ aaa1232323affe232323f8a1232323fefe232323a140232323a7a0 / <nil>
s4/ aaa1232323affe232323f8a1232323fefe232323a140232323a7a0 / <nil>

What did you see instead?

s1/ efbfbd232323efbfbd232323efbfbd232323efbfbd232323efbfbd232323efbfbd / <nil>
s2/ efbfbd232323efbfbd232323efbfbd232323efbfbd232323efbfbd232323efbfbd / <nil>
s3/ 833898372323238338d1302323238338d13123232383399438232323833994392323238339d830 / <nil>
s4/  / encoding: rune not supported by encoding.

According to GB18030-2022[^1] §7.2 "双字节部分的码位分配" and §7.3 "四字节部分的码位分配", there are 4 Private User Area ranges (the first three same as the GBK encoding):

  1. [\xAA-\xAF][\xA1-\xFE] (564 code points)
  2. [\xF8-\xFE][\xA1-\xFE] (658 code points)
  3. [\xA1-\xA7][\x40-\x7E\x80-\xA0] (672 code points)
  4. [\xFD-\xFE][\x30-\x39][\x81-\xFE][\x30-\x39] (25200 code points)

There are explicit mappings of the first 3 ranges[^2] to the Unicode PUA range, specified in the Appendix A pp.83–90 which is normative.

  1. AAA1 maps to U+E000, allocating sequentially until AFFE mapping to U+E233
  2. F8A1 maps to U+E234, allocating sequentially until FEFE mapping to U+E4C5
  3. A140 maps to U+E4C6, allocating sequentially until A7A0 mapping to U+E765

Instead, the current implementation of x/test:

I'd also like to note that Python 3.11 produces the correct mapping with its GB18030 codec:

>>> b"\xaa\xa1###\xaf\xfe###\xf8\xa1###\xfe\xfe###\xa1\x40###\xa7\xa0".decode('gb18030')
'\ue000###\ue233###\ue234###\ue4c5###\ue4c6###\ue765'
>>> _.encode('gb18030')
b'\xaa\xa1###\xaf\xfe###\xf8\xa1###\xfe\xfe###\xa1@###\xa7\xa0'

[^1]: The simplified Chinese version of the standard is freely available on https://archive.org/details/GB18030-2022 [^2]: GB18030 did not specify a mapping for the quad-byte PUA range FD308130–FE39FE39. According to https://icu-project.org/docs/papers/unicode-gb18030-faq.html, “Normally, they need to be treated as unassigned codes.”.

kennytm commented 1 year ago

cc #61165, #41990

dmitshur commented 1 year ago

CC @mpvl.