s1/ efbfbd232323efbfbd232323efbfbd232323efbfbd232323efbfbd232323efbfbd / <nil>
s2/ efbfbd232323efbfbd232323efbfbd232323efbfbd232323efbfbd232323efbfbd / <nil>
s3/ 833898372323238338d1302323238338d13123232383399438232323833994392323238339d830 / <nil>
s4/ / encoding: rune not supported by encoding.
According to GB18030-2022[^1] §7.2 "双字节部分的码位分配" and §7.3 "四字节部分的码位分配", there are 4 Private User Area ranges (the first three same as the GBK encoding):
There are explicit mappings of the first 3 ranges[^2] to the Unicode PUA range, specified in the Appendix A pp.83–90 which is normative.
AAA1 maps to U+E000, allocating sequentially until AFFE mapping to U+E233
F8A1 maps to U+E234, allocating sequentially until FEFE mapping to U+E4C5
A140 maps to U+E4C6, allocating sequentially until A7A0 mapping to U+E765
Instead, the current implementation of x/test:
wrongly decodes all these 3 ranges of double-byte PUA characters to U+FFFD (the "s1" and "s2" tests above)
wrongly encodes U+E000 to U+E765 to the quad-byte range for U+F014 to U+F779 (83389837–8339d830) which does not round-trip (the "s3" and "s4" tests above).
I'd also like to note that Python 3.11 produces the correct mapping with its GB18030 codec:
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?Playground.
What did you do?
https://go.dev/play/p/v_4hT9WSD7_y
What did you expect to see?
What did you see instead?
According to GB18030-2022[^1] §7.2 "双字节部分的码位分配" and §7.3 "四字节部分的码位分配", there are 4 Private User Area ranges (the first three same as the GBK encoding):
[\xAA-\xAF][\xA1-\xFE]
(564 code points)[\xF8-\xFE][\xA1-\xFE]
(658 code points)[\xA1-\xA7][\x40-\x7E\x80-\xA0]
(672 code points)[\xFD-\xFE][\x30-\x39][\x81-\xFE][\x30-\x39]
(25200 code points)There are explicit mappings of the first 3 ranges[^2] to the Unicode PUA range, specified in the Appendix A pp.83–90 which is normative.
Instead, the current implementation of
x/test
:I'd also like to note that Python 3.11 produces the correct mapping with its GB18030 codec:
[^1]: The simplified Chinese version of the standard is freely available on https://archive.org/details/GB18030-2022 [^2]: GB18030 did not specify a mapping for the quad-byte PUA range FD308130–FE39FE39. According to https://icu-project.org/docs/papers/unicode-gb18030-faq.html, “Normally, they need to be treated as unassigned codes.”.