golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.95k stars 17.53k forks source link

x/text/encoding/traditionalchinese: Garbled text found in encoding output file with traditional chinese #43581

Open huyungtang opened 3 years ago

huyungtang commented 3 years ago

What version of Go are you using (go version)?

go version go1.15.6 darwin/amd64

Does this issue reproduce with the latest release?

1.15.6 is the latest stable release

What operating system and processor architecture are you using (go env)?

This has nothing to do with the environment

What did you do?

Using golang.org/x/text/encoding/traditionalchinese to encoding text & writing chinese to a file.
Then opening the output file with encoding "Tradition Chinese (Big5) cp950" in Visual Studio Code, garbled text found. Re-open with "Tradition Chinese (Big5-HKSCS) big5hkscs" to see the normal text.

I found some duplicate records in the source file of "tables.go".

===== http://encoding.spec.whatwg.org/index-big5.txt ===== 8007 0x5A77 婷 (<CJK Ideograph>) <-- Big5 19240 0x5A77 婷 (<CJK Ideograph>) <-- Big5HKSCS

8616 0x745C 瑜 (<CJK Ideograph>) <-- Big5 19672 0x745C 瑜 (<CJK Ideograph>) <-- Big5HKSCS

Cloud you please separate the encoding "traditionalchinese" into two different encodings "Big5" & "Big5-HKSCS"?

mengzhuo commented 3 years ago

CC @mpvl

a00012025 commented 3 days ago

Hi @mengzhuo and the Go team,

I’m currently experiencing the same issue regarding garbled text when encoding Traditional Chinese characters using golang.org/x/text/encoding/traditionalchinese. Specifically, characters like “包” are not being encoded correctly, resulting in unexpected characters such as “?” in the output.

Is there an ongoing effort to separate the encodings into Big5 and Big5-HKSCS as initially suggested? Additionally, are there any workarounds or recommended practices in the meantime to ensure accurate encoding of Traditional Chinese characters?

Thank you for your time and assistance.

huyungtang commented 3 days ago

Hi @mengzhuo and the Go team,

I’m currently experiencing the same issue regarding garbled text when encoding Traditional Chinese characters using golang.org/x/text/encoding/traditionalchinese. Specifically, characters like “包” are not being encoded correctly, resulting in unexpected characters such as “?” in the output.

Is there an ongoing effort to separate the encodings into Big5 and Big5-HKSCS as initially suggested? Additionally, are there any workarounds or recommended practices in the meantime to ensure accurate encoding of Traditional Chinese characters?

Thank you for your time and assistance.

Hi @a00012025

要不要試一下我改的 https://github.com/huyungtang/text! 將 golang.org/x/text 改指到這個倉庫的路徑即可使用;PR 已發出許久,沒下落前我是這麼使用的。

之前我修改了 encoding/traditionalchinese/maketables.go,將Big5 分拆為 Big5 與 Big5HK, 主要是將原本的 Big5 改命名為 Big5HK,另外生成 Big5 做為台灣繁中使用;於生成台灣繁中時, 僅略過香港繁中裡重覆的文字,未做其它變更。

a00012025 commented 3 days ago

@huyungtang 非常感謝 🙏 我來試試看!

mengzhuo commented 3 days ago

FYI the CL text/397534 require some works to be merge.