Open yuki2006 opened 1 month ago
CC @mpvl
If this is a wise approach, and it well may be, there should already be an official defining table for how to handle the translation. Go's implementation should not be the one to codify it.
Thank you for your comment. Indeed, it might not be appropriate for the Go standard library to create a definition table.
In that case, would it be possible to identify which character (and at which position) failed to encode, and furthermore, allow us to specify a fallback when the conversion fails? (It might be convenient if we could specify a callback function, for example.)
https://go.dev/play/p/Jg6oE7cko4i
Postscript: It seems we can identify the location by using the return value n from transform.String.
Japanese versions of Windows still treat file names as Shift_JIS in some processes. This is not limited to Japanese, but is also the case in China and Korea, where Double Byte Character Sets are used. The Go language, which uses utf-8 as its internal encoding, has almost no problem when using the Windows wide character API to determine filenames, but when Go uses the command line to control specific filenames, it must handle Shift_JIS. In such cases, we want to use fallback characters to replace characters that only exist in UTF-8.
Proposal Details
Summary
When encoding Unicode strings to Shift JIS in Go, certain visually similar characters cannot be directly represented in Shift JIS, leading to encoding errors. This causes confusion because the characters appear similar but result in errors during encoding. This proposal suggests introducing a normalization step that replaces these problematic characters with their Shift JIS-compatible equivalents before encoding. We accept that this transformation is one-way and that the original characters cannot be restored, which is acceptable for our use case.
Background
Shift JIS is a character encoding for the Japanese language but does not support all Unicode characters. Some visually similar characters have different code points and cannot be encoded in Shift JIS, causing encoding errors and confusion.
Examples:
The Unicode character "〜" (U+301C) looks similar to "~" (U+FF5E). The Unicode character "−" (U+2212) resembles the standard hyphen "-" (U+002D). These visually similar characters are often used interchangeably in text but may cause encoding errors when converting to Shift JIS. In our application, it is acceptable that the transformation is not reversible; we prioritize successful encoding over the ability to revert to the original characters.
Proposal
Introduce a normalization function that replaces visually similar Unicode characters, which cannot be encoded in Shift JIS, with their equivalent characters that can be encoded. This function can be integrated into the encoding process or provided as a utility in the golang.org/x/text/encoding/japanese package.
https://go.dev/play/p/OtEWoZmxDzb
Output