golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.06k stars 17.68k forks source link

x/text/encoding/simplifiedchinese: missing decoding data #61165

Open folivoramao opened 1 year ago

folivoramao commented 1 year ago

What version of Go are you using (go version)?

$ go version
go version go1.20.2 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="on"
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/mjc/Library/Caches/go-build"
GOENV="/Users/mjc/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/mjc/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/mjc/go"
GOPRIVATE=""
GOPROXY="https://goproxy.cn,direct"
GOROOT="/usr/local/Cellar/go/1.20.2/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.20.2/libexec/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.20.2"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="cc"
CXX="c++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/hd/v9qhg5rj04z7bp6wb5kpdc_m0000gn/T/go-build3435336719=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I encountered a problem in character set encoding conversion: when using the simplifiedchinese package to convert a GB18030-encoded character to UTF8, an error is reported. But I can convert successfully when I use the mahonia package. code link:https://go.dev/play/p/NhBp0JQ2RUp

package main

import (
    "encoding/hex"
    "fmt"

    "github.com/axgle/mahonia"
    "golang.org/x/text/encoding/simplifiedchinese"
)

func main() {
    s := `FDD2`
    hd, _ := hex.DecodeString(s)
    r, _ := simplifiedchinese.GB18030.NewDecoder().Bytes(hd)
    he := hex.EncodeToString([]byte(r))
    fmt.Println(he) // efbfbd

    r2 := mahonia.NewDecoder("GB18030").ConvertString(string(hd))
    he2 := hex.EncodeToString([]byte(r2))
    fmt.Println(he2) // ee90bb
}

What did you expect to see?

ee90bb

What did you see instead?

efbfbd
robpike commented 1 year ago

Not sure what's wrong, as I am not familiar with the encoding, but I can point out a couple of details. First, you're getting the replacement character U+FFFD, which means there is something wrong with that character according to x/text. That is interesting. You can see this by printing things differently, and you can also simplify your example significantly since fmt.Printf can do all the hex/string work for you:

https://go.dev/play/p/kDgB3ybMa8c

Finally, you should always check your errors, especially when debugging, although that didn't help here.

seankhliao commented 1 year ago

It would appear that the decode table is just lacking data, the given test case would decode to 23705. https://go.googlesource.com/text/+/refs/heads/master/encoding/simplifiedchinese/tables.go#22009

whatwg seems to have changed urls for their table data, so I'm not sure what a new table would be generated from (presumably one of these https://encoding.spec.whatwg.org/#indexes )

bcmills commented 1 year ago

(CC @mpvl per https://dev.golang.org/owners)