golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.26k stars 17.7k forks source link

proposal: x/text/encoding: handling encoding errors by replacing visually similar unicode characters in ShiftJIS encoding #69934

Open yuki2006 opened 1 month ago

yuki2006 commented 1 month ago

Proposal Details

Summary

When encoding Unicode strings to Shift JIS in Go, certain visually similar characters cannot be directly represented in Shift JIS, leading to encoding errors. This causes confusion because the characters appear similar but result in errors during encoding. This proposal suggests introducing a normalization step that replaces these problematic characters with their Shift JIS-compatible equivalents before encoding. We accept that this transformation is one-way and that the original characters cannot be restored, which is acceptable for our use case.

Background

Shift JIS is a character encoding for the Japanese language but does not support all Unicode characters. Some visually similar characters have different code points and cannot be encoded in Shift JIS, causing encoding errors and confusion.

Examples:

The Unicode character "〜" (U+301C) looks similar to "~" (U+FF5E). The Unicode character "−" (U+2212) resembles the standard hyphen "-" (U+002D). These visually similar characters are often used interchangeably in text but may cause encoding errors when converting to Shift JIS. In our application, it is acceptable that the transformation is not reversible; we prioritize successful encoding over the ability to revert to the original characters.

Proposal

Introduce a normalization function that replaces visually similar Unicode characters, which cannot be encoded in Shift JIS, with their equivalent characters that can be encoded. This function can be integrated into the encoding process or provided as a utility in the golang.org/x/text/encoding/japanese package.

https://go.dev/play/p/OtEWoZmxDzb

package main

import (
    "fmt"

    "golang.org/x/text/encoding/japanese"
    "golang.org/x/text/transform"
)

func main() {
    replacements := map[string]string{
        "〜": "~", // U+301C (Wave Dash) → U+FF5E (Fullwidth Tilde)
        "−": "-", // U+2212 (Minus Sign) → U+002D (Hyphen-Minus)
        "—": "-", // U+2014 (Em Dash) → U+002D (Hyphen-Minus)
        "•": "*", // U+2022 (Bullet) → U+002A (Asterisk)
    }

    encoder := japanese.ShiftJIS.NewEncoder()

    for orig, replacement := range replacements {
        // Check if the original character can be encoded
        _, _, errOrig := transform.String(encoder, orig)
        // Check if the replacement character can be encoded
        _, _, errReplacement := transform.String(encoder, replacement)

        if errOrig == nil {
            fmt.Printf("Mapping may be unnecessary: Original character %q can be encoded.\n", orig)
        } else {
            fmt.Printf("Mapping necessary: Original character %q cannot be encoded: %v\n", orig, errOrig)
        }
        if errReplacement != nil {
            fmt.Printf("Warning: Replacement character %q cannot be encoded: %v\n", replacement, errReplacement)
        }
    }
}

Output

Mapping necessary: Original character "•" cannot be encoded: encoding: rune not supported by encoding.
Mapping necessary: Original character "〜" cannot be encoded: encoding: rune not supported by encoding.
Mapping necessary: Original character "−" cannot be encoded: encoding: rune not supported by encoding.
Mapping necessary: Original character "—" cannot be encoded: encoding: rune not supported by encoding.
ianlancetaylor commented 1 month ago

CC @mpvl

robpike commented 1 month ago

If this is a wise approach, and it well may be, there should already be an official defining table for how to handle the translation. Go's implementation should not be the one to codify it.

yuki2006 commented 1 month ago

Thank you for your comment. Indeed, it might not be appropriate for the Go standard library to create a definition table.

In that case, would it be possible to identify which character (and at which position) failed to encode, and furthermore, allow us to specify a fallback when the conversion fails? (It might be convenient if we could specify a callback function, for example.)

https://go.dev/play/p/Jg6oE7cko4i

Postscript: It seems we can identify the location by using the return value n from transform.String.

mattn commented 1 month ago

Japanese versions of Windows still treat file names as Shift_JIS in some processes. This is not limited to Japanese, but is also the case in China and Korea, where Double Byte Character Sets are used. The Go language, which uses utf-8 as its internal encoding, has almost no problem when using the Windows wide character API to determine filenames, but when Go uses the command line to control specific filenames, it must handle Shift_JIS. In such cases, we want to use fallback characters to replace characters that only exist in UTF-8.