Open litherum opened 2 years ago
@jfkthame @drott @r12a
I just verified that this:
uint8_t input[] = {0x5C};
uint32_t output[1];
int32_t length = ucnv_convert("UTF32_PlatformEndian", "Shift_JIS", (char*)output, sizeof(output), (const char*)input, sizeof(input), &errorCode);
assert(U_SUCCESS(errorCode));
assert(length == sizeof(output));
printf("U+%04" PRIX32 "\n", output[0]);
prints U+005C
I just added an i18n-jlreq label, so folks in the Japanese Language Enablement task force will be notified about this thread, and they may be able to provide on-the-ground advice.
Do I understand correctly, that WebKit performs the character replacement as well if the text encoding is a Unicode encoding (say UTF-8, UTF-16) and only the font name is one of the MS Japanese ones?
@litherum do you have any data on how often you encounter the legacy encodings compared to Unicode ones?
What I find unfortunate is that fonts also show a '¥' for '\' U+005C when the text encoding is Unicode - which is not helping to phase out this font-based workaround longer term.
As a proposal, would it make sense to add two stylistic sets to the listed fonts for two modes, i.e. 1) one where they show U+005C as yen sign and 2) one where they show U+005C as backslash? Then perhaps the "dont't convert" stylistic set (2) could be activated if the text encoding is Unicode-based?
CC @peterconstable
Shouldn't this issue be filed against https://github.com/whatwg/encoding ? (But also I am pretty sure the Encoding standard requires not doing the character substitution that WebKit does at the text decoding level, and has WPT test cases that verify this requirement).
@drott
Do I understand correctly, that WebKit performs the character replacement as well if the text encoding is a Unicode encoding (say UTF-8, UTF-16) and only the font name is one of the MS Japanese ones?
Correct. We do the replacement if the encoding is one of the ones listed above, or if the requested font name is one of the above. If content is UTF-8 and has font-family: MS PGothic
, we'll do the replacement.
do you have any data on how often you encounter the legacy encodings compared to Unicode ones?
I don't. (Are you considering Shift JIS 'legacy'? It's still fairly popular - Wikipedia says "Shift JIS is the second-most popular character encoding for Japanese websites, used by 5.6% of sites in the .jp domain.")
would it make sense to add two stylistic sets to the listed fonts
It looks like Meiryo maps U+005C to yen but has a seemingly-unused lookup to substitute it to backslash; MS Mincho maps U+005C to yen but has no such lookup.
@PeterConstable What do you think about @drott's proposal above?
After a quick search, at least it shouldn't apply to EUC-JP? https://bugs.webkit.org/show_bug.cgi?id=24906 https://bugs.chromium.org/p/chromium/issues/detail?id=9696
Edit: Are Korean encoding also affected because of their Won sign, and should klreq be asked too?
This is a well-known annoyance here in Japan.
The origin of the problem is that in the Japanese version of ISO 646, the backslash and the tilde are replaced by the Yen sign (for the currency) and the overbar (macron, https://en.wikipedia.org/wiki/Macron_(diacritic)). Most variants of ISO 646 died out with the introduction of ISO 8859; in Europe, the costs of changing to a cross-country encoding was easy to justify. In Japan, the Yen sign (and the macron) remained, and got integrated into multibyte encodings, in particular Shift_JIS. When ASCII and later Unicode become popular, it was clear to the people involved that there was a problem. But a solution wasn't, and isn't easy. The macron was intended for writing overbars on long vowels, but this practice isn't really widely used, and the macron and the tilde are similar enough, and so for some time, people just didn't look too closely, and now I think most if not all fonts are using a tilde. So that part of the problem is solved. But the Yen sign is much, much harder. Financial documents are very important. They can't contain a backslash. So the IT industry, with all their backslashes used for escaping and such, usually goes second place and lets the financial industry dominate. Every person in Japan who has dealt with scripts or programs on computers knows how to read a Yen sign as a backslash, or a backslash as a Yen sign (they might think that the Yen sign is the 'real thing'). Some Japanese programming books (not all), in particular those for beginners, and those centered on Windows, also print backslashes as Yen signs. The solution would need two steps: 1) Change all the backslashes that are intended as Yen signs to the 'real' Unicode Yen sign (U+00a5) or its full-width version (U+ffe5). 2) Change the fonts to show the real backslash. Unfortunately, step 2) is a precondition of step 1), because it's otherwise not possible to distinguish U+005C and the 'real' Yen signs. So we are in a deadlock. About 10 years ago, I came up with a 'solution', which I presented as part of a talk at the Internationalization and Unicode Conference, with some people from Microsoft in attendance. The proposal was to start tweaking the Yen sign glyph at U+005C in the relevant fonts to slowly but steadily lose the upper right arm and the crossbars and at the same time tilt the lower half stem so that after a few iterations, we would end up at a backslash. The idea was to unobtrusively let it show when a backslash was (mis)used for a Yen sign, but still at least at the start have everyday people read the sign as a Yen sign with their financial stuff. Alas, even this solution didn't get implemented :-). Overall, the problem is somewhat similar to the Y2K problem: A lot of software and data needs to be changed to solve the problem. The main difference is that there's no deadline. That means nobody is in a rush :-(.
Also, isn't there the exact same problem with the U+20A9 korean won sign ? It too is used for paths on windows where you'd otherwise expect an backslash (or a yen sign).
Yes, in many Korean fonts, the glyph mapped from U+005C is a won sign.
@duerst I can't speak for Microsoft's font teams on your proposed 'solution', but my guess is that it wouldn't fly with customers in Japan and so wouldn't be viable.
The real issue is existing content that uses U+005C instead of U+00A5 (or U+20A9), or any implementations that might be generating new occurrences of U+005C as the currency sign. I'd suspect that current versions of the most commonly-used fonts (certainly, current versions shipped by Microsoft) do have U+00A5 (or U+20A9) mapped to an appropriate glyph. If input methods and currency formatting APIs were outputting U+00A5 (U+20A9) and not U+005C, then that would at least prevent new occurrences. Then the key problem for users becomes one of searching and editing: if a Japanese user searches for U+00A5, that should match occurrences of U+005C, or vice versa. (Similarly for Korean.)
Looking at the current Japanese IME in Windows, when set into Japanese mode, pressing the \ key presents three candidate characters, in this order: U+00A5, U+FFE5, and U+005C.
Using the soft keyboard for Japanese, though, I've only found ways of entering U+FFE5 or U+005C (according to the Japanese/Roman mode).
Looking at the current Japanese IME in Windows, when set into Japanese mode, pressing the \ key presents three candidate characters, in this order: U+00A5, U+FFE5, and U+005C.
In my opinion, Microsoft could [and should] at least label the U+005C-displaying-as-Yen symbol in the candidate list as [環境依存], indicating users should not anticipate U+005C always being displayed as Yen symbol.
By the way, is this also affecting Android apps running in Windows 11?
@duerst I can't speak for Microsoft's font teams on your proposed 'solution', but my guess is that it wouldn't fly with customers in Japan and so wouldn't be viable.
I know. I probably should have added a :-) to it. It just shows how much of an uphill battle this is. With time, things hopefully will get better, but only with a lot of time.
It's been half a year since the last comment on this issue. It would be really nice if we could get a resolution here. Agenda+ for visibility.
Sorry that it took a long time to respond. I am from the Japanese Language Enablement task force that @r12a mentioned (but I am not speaking for the group) and a former Apple engineer in the OS internationalization group.
As you know, this is a legacy issue. It would not get any better without breaking something at least once. Here’s what I think:
I would remove all tricks for fonts. If a web page is in Unicode and uses U+005C, it is a reverse solidus, period. Removing the tricks would break the look of some Windows-centric web pages in a way many people can recognize what’s happening. I would expect some complaints from users, but it is a bug in the web page, and the behaviour is consistent with other browsers on the platform.
Theoretically, 0x5C in Shift-JIS is a yen sign and converting it to Unicode should give you U+00A5. As @litherum mentioned, 0x5C has a semantic meaning in some contexts, regardless of the encoding. Unfortunately, if the text is in Shift-JIS, it is the yen sign that carries the semantic meaning. I believe this is why encoding converters often (always?) convert 0x5C in Shift-JIS to U+005C in Unicode, while theoretically, it should be U+00A5. Breaking the look is better than breaking the function.
If WebKit has patches to change U+005C to U+00A5 (or is it just changing the look of U+005C as if it were U+00A5 without changing the backing store? in either case), it is effectively fixing the issue with the encoding converter. Showing a yen sign for Shift-JIS 0x5C is correct behaviour. If my understanding is accurate, there would be no reason to revert it back. For the longer term when we have a smaller number of JIS-based pages, I hope we can fix the encoding converters. It is a harder and riskier change, but it makes things simpler.
Sorry that it took a long time to respond. I am from the Japanese Language Enablement task force that @r12a mentioned (but I am not speaking for the group) and a former Apple engineer in the OS internationalization group.
As you know, this is a legacy issue. It would not get any better without breaking something at least once. Here’s what I think:
Fonts
I would remove all tricks for fonts. If a web page is in Unicode and uses U+005C, it is a reverse solidus, period. Removing the tricks would break the look of some Windows-centric web pages in a way many people can recognize what’s happening. I would expect some complaints from users, but it is a bug in the web page, and the behaviour is consistent with other browsers on the platform.
Encoding
Theoretically, 0x5C in Shift-JIS is a yen sign and converting it to Unicode should give you U+00A5. As @litherum mentioned, 0x5C has a semantic meaning in some contexts, regardless of the encoding. Unfortunately, if the text is in Shift-JIS, it is the yen sign that carries the semantic meaning. I believe this is why encoding converters often (always?) convert 0x5C in Shift-JIS to U+005C in Unicode, while theoretically, it should be U+00A5. Breaking the look is better than breaking the function.
If WebKit has patches to change U+005C to U+00A5 (or is it just changing the look of U+005C as if it were U+00A5 without changing the backing store? in either case), it is effectively fixing the issue with the encoding converter. Showing a yen sign for Shift-JIS 0x5C is correct behaviour. If my understanding is accurate, there would be no reason to revert it back. For the longer term when we have a smaller number of JIS-based pages, I hope we can fix the encoding converters. It is a harder and riskier change, but it makes things simpler.
Webkit once attempted converting U+005C to U+00A5 and resulted in a number of unexpected behaviors. See https://bugs.webkit.org/show_bug.cgi?id=24906
Webkit once attempted converting U+005C to U+00A5 and resulted in a number of unexpected behaviors. See >https://bugs.webkit.org/show_bug.cgi?id=24906
Is it specific to EUC-JP in which 0x5C is a reverse solidus?
Webkit once attempted converting U+005C to U+00A5 and resulted in a number of unexpected behaviors. See >https://bugs.webkit.org/show_bug.cgi?id=24906
Is it specific to EUC-JP in which 0x5C is a reverse solidus?
According to http://miau.s9.xrea.com/blog/index.php?itemid=990 , back then the chromium code applied this conversion to EUC-JP together with "Shift_JIS_X0213-2000" but since "Shift_JIS_X0213-2000" would not appear as an option when choosing encoding so only EUC-JP is actually affected by the conversion code.
Also, as described in blogs like https://naruse.hateblo.jp/entry/20100327/1269684858 , a specific problem with converting 0x5C into U+00A5 is that Japanese webpages might write codes that involve 0x5C and have it displayed in Yen sign, but if any non-Japanese person try to copy the code in a browser that converted all of them to U+00A5 then those codes being copied wouldn't work.
If 0x5C is a reverse solidus with EUC-JP, then it should be converted to U+005C and nothing else.
In case of Shift-JIS, from what @litherum described it appears that Shift-JIS 0x0C is converted to U+005C, and then the webkit renders it as if it were U+00A5, leaving the backing store U+005C as is. Is my understanding correct? If so I think this is a good compromise like I mentioned.
If 0x5C is a reverse solidus with EUC-JP, then it should be converted to U+005C and nothing else.
In case of Shift-JIS, from what @litherum described it appears that Shift-JIS 0x0C is converted to U+005C, and then the webkit renders it as if it were U+00A5, leaving the backing store U+005C as is. Is my understanding correct? If so I think this is a good compromise like I mentioned.
Ah if you mean what the spec of EUC JP said, then there is apparently no single standard, but one of the more referenced document from UI-OSF 日本語環境実装規約 from UI-OSF Japanese Localization Group in 1993 say it is necessary to support either "ISO 646 IRV (ASCII)" OR "JIS X 0201 Romaji 7 bit symbol". At the code point, ASCII is reverse solidus, while JIS X 0201 is Yen symbol.
It seems that we have two ways forward towards interoperability here:
Question to the CSSWG is which way we want to move forward.
The CSS Working Group just discussed Backslash and Yen
.
For future reference, this was the bug #305827 in which Blink removed the previous mechanism, which was not text-transform based.
@emilio and @drott did you ever complete your investigation? Would be good to have an update here.
@annevk, I see from the minutes action items:
ACTION drott to investigate for Chrome ACTION emilio to investigate for Firefox
But I have trouble extracting exactly what were supposed to investigate. I dug out the removed heuristic: https://chromium.googlesource.com/chromium/src/+/2b4002d072669ff145b0afc1a47272503e3d7879%5E%21/#:~:text=%2DFontTranscoder%3A%3AConverterType-,FontTranscoder%3A%3AconverterType,-(const%20FontDescription%26%20fontDescription
And as much as I understand from this 10 year old change: We did seem to modify an internal buffer, different from what we do for text-transform
, but did some extra logic for clipboard handling so that we would not copy out modified strings, but original strings.
The logic we used was: do this for a list of specified fonts, and if no primary font is specified at all.
@annevk - were you curious about anything else?
Looking at the code, I am not very comfortable adding something like this back in - especially the part about keeping a modified representation of the source text internally and having to special case clipboard operations.
In WebKit we have a bunch of code to replace the U+005C REVERSE SOLIDUS (commonly knows as "backslash") character with U+00A5 YEN SIGN, using the same mechanism as
text-transform
uses. We do this in certain conditions:We appear to be the only browser on Mac that does this. On Windows, browsers don't appear to do this, because those fonts have glyphs for the U+005C character that look like the yen sign. It appears to be implemented in the fonts themselves, so the browsers don't appear to do anything on Windows.
Some background reading:
There appear to be two similar yet distinct problems:
These two problems sort of cancel each other out - if a text decoder turns the yen sign into the wrong in-memory character, but then a font is used which renders that character like a yen sign, the user gets what they expect. (Copy/paste doesn't work, a11y doesn't work, and find-in-page (probably) doesn't work, though.)
The reason WebKit has this special handling is because those fonts listed above don't exist on the Mac. If an author is writing their page on Windows, and they are typing their source code and want to type the yen sign, they might type the backslash character, and page would appear to work on Windows. Then, when someone else visits the page on a Mac, font fallback occurs, we use a different font to render the content, and their character shows up as a backslash, which isn't what they wanted. So, we "fixed this" by just magically turning the backslash into a yen sign in-memory, because we're trying to be helpful.
We (the web platform) should determine what to do here. WebKit appears to be the only browser which tries to be helpful like this. We added this handling before 2007, way before the Blink fork, and Blink no longer seems to have this behavior, so presumably they intentionally deleted it.
Should other browsers try to be helpful like WebKit? Should WebKit try to stop being helpful? Should we ask Microsoft to change these glyphs in their fonts? Maybe the problem has mostly alleviated itself in the at-least 14 years since it was investigated last, and WebKit can just delete its special handling?
(This issue might belong better in different standards groups, but I don't know which ones, so I'm starting it here and I can migrate it as necessary.)