Backslash & Yen sign behavior

litherum commented 2 years ago

In WebKit we have a bunch of code to replace the U+005C REVERSE SOLIDUS (commonly knows as "backslash") character with U+00A5 YEN SIGN, using the same mechanism as text-transform uses. We do this in certain conditions:

The font name is one of:
- MS PGothic
- MS PMincho
- MS Gothic
- MS Mincho
- Meiryo
OR the encoding is one of:
- x-mac-japanese
- ISO-2022-JP
- EUC-JP
- Shift_JIS
- Shift_JIS_X0213-2000

We appear to be the only browser on Mac that does this. On Windows, browsers don't appear to do this, because those fonts have glyphs for the U+005C character that look like the yen sign. It appears to be implemented in the fonts themselves, so the browsers don't appear to do anything on Windows.

Some background reading:

There appear to be two similar yet distinct problems:

Shift JIS says 0x005C is how the yen sign is represented, but for some reason text decoders don't turn that into U+00A5 in-memory. Fixing this might be tricky, because the backslash has semantic meaning within Javascript sources.
Some fonts on Windows seem to think that U+005C visually looks like a yen sign

These two problems sort of cancel each other out - if a text decoder turns the yen sign into the wrong in-memory character, but then a font is used which renders that character like a yen sign, the user gets what they expect. (Copy/paste doesn't work, a11y doesn't work, and find-in-page (probably) doesn't work, though.)

The reason WebKit has this special handling is because those fonts listed above don't exist on the Mac. If an author is writing their page on Windows, and they are typing their source code and want to type the yen sign, they might type the backslash character, and page would appear to work on Windows. Then, when someone else visits the page on a Mac, font fallback occurs, we use a different font to render the content, and their character shows up as a backslash, which isn't what they wanted. So, we "fixed this" by just magically turning the backslash into a yen sign in-memory, because we're trying to be helpful.

We (the web platform) should determine what to do here. WebKit appears to be the only browser which tries to be helpful like this. We added this handling before 2007, way before the Blink fork, and Blink no longer seems to have this behavior, so presumably they intentionally deleted it.

Should other browsers try to be helpful like WebKit? Should WebKit try to stop being helpful? Should we ask Microsoft to change these glyphs in their fonts? Maybe the problem has mostly alleviated itself in the at-least 14 years since it was investigated last, and WebKit can just delete its special handling?

(This issue might belong better in different standards groups, but I don't know which ones, so I'm starting it here and I can migrate it as necessary.)

litherum commented 2 years ago

@jfkthame @drott @r12a

litherum commented 2 years ago

I just verified that this:

uint8_t input[] = {0x5C};
uint32_t output[1];
int32_t length = ucnv_convert("UTF32_PlatformEndian", "Shift_JIS", (char*)output, sizeof(output), (const char*)input, sizeof(input), &errorCode);
assert(U_SUCCESS(errorCode));
assert(length == sizeof(output));
printf("U+%04" PRIX32 "\n", output[0]);

prints U+005C

r12a commented 2 years ago

I just added an i18n-jlreq label, so folks in the Japanese Language Enablement task force will be notified about this thread, and they may be able to provide on-the-ground advice.

drott commented 2 years ago

Do I understand correctly, that WebKit performs the character replacement as well if the text encoding is a Unicode encoding (say UTF-8, UTF-16) and only the font name is one of the MS Japanese ones?

@litherum do you have any data on how often you encounter the legacy encodings compared to Unicode ones?

What I find unfortunate is that fonts also show a '¥' for '\' U+005C when the text encoding is Unicode - which is not helping to phase out this font-based workaround longer term.

As a proposal, would it make sense to add two stylistic sets to the listed fonts for two modes, i.e. 1) one where they show U+005C as yen sign and 2) one where they show U+005C as backslash? Then perhaps the "dont't convert" stylistic set (2) could be activated if the text encoding is Unicode-based?

CC @peterconstable

othermaciej commented 2 years ago

Shouldn't this issue be filed against https://github.com/whatwg/encoding ? (But also I am pretty sure the Encoding standard requires not doing the character substitution that WebKit does at the text decoding level, and has WPT test cases that verify this requirement).

litherum commented 2 years ago

@drott

Do I understand correctly, that WebKit performs the character replacement as well if the text encoding is a Unicode encoding (say UTF-8, UTF-16) and only the font name is one of the MS Japanese ones?

Correct. We do the replacement if the encoding is one of the ones listed above, or if the requested font name is one of the above. If content is UTF-8 and has font-family: MS PGothic, we'll do the replacement.

do you have any data on how often you encounter the legacy encodings compared to Unicode ones?

I don't. (Are you considering Shift JIS 'legacy'? It's still fairly popular - Wikipedia says "Shift JIS is the second-most popular character encoding for Japanese websites, used by 5.6% of sites in the .jp domain.")

would it make sense to add two stylistic sets to the listed fonts

It looks like Meiryo maps U+005C to yen but has a seemingly-unused lookup to substitute it to backslash; MS Mincho maps U+005C to yen but has no such lookup.

@PeterConstable What do you think about @drott's proposal above?

c933103 commented 2 years ago

After a quick search, at least it shouldn't apply to EUC-JP? https://bugs.webkit.org/show_bug.cgi?id=24906 https://bugs.chromium.org/p/chromium/issues/detail?id=9696

Edit: Are Korean encoding also affected because of their Won sign, and should klreq be asked too?

duerst commented 2 years ago

This is a well-known annoyance here in Japan.

The origin of the problem is that in the Japanese version of ISO 646, the backslash and the tilde are replaced by the Yen sign (for the currency) and the overbar (macron, https://en.wikipedia.org/wiki/Macron_(diacritic)). Most variants of ISO 646 died out with the introduction of ISO 8859; in Europe, the costs of changing to a cross-country encoding was easy to justify. In Japan, the Yen sign (and the macron) remained, and got integrated into multibyte encodings, in particular Shift_JIS. When ASCII and later Unicode become popular, it was clear to the people involved that there was a problem. But a solution wasn't, and isn't easy. The macron was intended for writing overbars on long vowels, but this practice isn't really widely used, and the macron and the tilde are similar enough, and so for some time, people just didn't look too closely, and now I think most if not all fonts are using a tilde. So that part of the problem is solved. But the Yen sign is much, much harder. Financial documents are very important. They can't contain a backslash. So the IT industry, with all their backslashes used for escaping and such, usually goes second place and lets the financial industry dominate. Every person in Japan who has dealt with scripts or programs on computers knows how to read a Yen sign as a backslash, or a backslash as a Yen sign (they might think that the Yen sign is the 'real thing'). Some Japanese programming books (not all), in particular those for beginners, and those centered on Windows, also print backslashes as Yen signs. The solution would need two steps: 1) Change all the backslashes that are intended as Yen signs to the 'real' Unicode Yen sign (U+00a5) or its full-width version (U+ffe5). 2) Change the fonts to show the real backslash. Unfortunately, step 2) is a precondition of step 1), because it's otherwise not possible to distinguish U+005C and the 'real' Yen signs. So we are in a deadlock. About 10 years ago, I came up with a 'solution', which I presented as part of a talk at the Internationalization and Unicode Conference, with some people from Microsoft in attendance. The proposal was to start tweaking the Yen sign glyph at U+005C in the relevant fonts to slowly but steadily lose the upper right arm and the crossbars and at the same time tilt the lower half stem so that after a few iterations, we would end up at a backslash. The idea was to unobtrusively let it show when a backslash was (mis)used for a Yen sign, but still at least at the start have everyday people read the sign as a Yen sign with their financial stuff. Alas, even this solution didn't get implemented :-). Overall, the problem is somewhat similar to the Y2K problem: A lot of software and data needs to be changed to solve the problem. The main difference is that there's no deadline. That means nobody is in a rush :-(.

frivoal commented 2 years ago

Also, isn't there the exact same problem with the U+20A9 korean won sign ? It too is used for paths on windows where you'd otherwise expect an backslash (or a yen sign).

PeterConstable commented 2 years ago

Yes, in many Korean fonts, the glyph mapped from U+005C is a won sign.

PeterConstable commented 2 years ago

@duerst I can't speak for Microsoft's font teams on your proposed 'solution', but my guess is that it wouldn't fly with customers in Japan and so wouldn't be viable.

The real issue is existing content that uses U+005C instead of U+00A5 (or U+20A9), or any implementations that might be generating new occurrences of U+005C as the currency sign. I'd suspect that current versions of the most commonly-used fonts (certainly, current versions shipped by Microsoft) do have U+00A5 (or U+20A9) mapped to an appropriate glyph. If input methods and currency formatting APIs were outputting U+00A5 (U+20A9) and not U+005C, then that would at least prevent new occurrences. Then the key problem for users becomes one of searching and editing: if a Japanese user searches for U+00A5, that should match occurrences of U+005C, or vice versa. (Similarly for Korean.)

Looking at the current Japanese IME in Windows, when set into Japanese mode, pressing the \ key presents three candidate characters, in this order: U+00A5, U+FFE5, and U+005C.

Using the soft keyboard for Japanese, though, I've only found ways of entering U+FFE5 or U+005C (according to the Japanese/Roman mode).

c933103 commented 2 years ago

Looking at the current Japanese IME in Windows, when set into Japanese mode, pressing the \ key presents three candidate characters, in this order: U+00A5, U+FFE5, and U+005C.

In my opinion, Microsoft could [and should] at least label the U+005C-displaying-as-Yen symbol in the candidate list as [環境依存], indicating users should not anticipate U+005C always being displayed as Yen symbol.

By the way, is this also affecting Android apps running in Windows 11?

duerst commented 2 years ago

@duerst I can't speak for Microsoft's font teams on your proposed 'solution', but my guess is that it wouldn't fly with customers in Japan and so wouldn't be viable.

I know. I probably should have added a :-) to it. It just shows how much of an uphill battle this is. With time, things hopefully will get better, but only with a lot of time.

litherum commented 2 years ago

It's been half a year since the last comment on this issue. It would be really nice if we could get a resolution here. Agenda+ for visibility.

kidayasuo commented 2 years ago

Sorry that it took a long time to respond. I am from the Japanese Language Enablement task force that @r12a mentioned (but I am not speaking for the group) and a former Apple engineer in the OS internationalization group.

As you know, this is a legacy issue. It would not get any better without breaking something at least once. Here’s what I think:

Fonts

I would remove all tricks for fonts. If a web page is in Unicode and uses U+005C, it is a reverse solidus, period. Removing the tricks would break the look of some Windows-centric web pages in a way many people can recognize what’s happening. I would expect some complaints from users, but it is a bug in the web page, and the behaviour is consistent with other browsers on the platform.

Encoding

Theoretically, 0x5C in Shift-JIS is a yen sign and converting it to Unicode should give you U+00A5. As @litherum mentioned, 0x5C has a semantic meaning in some contexts, regardless of the encoding. Unfortunately, if the text is in Shift-JIS, it is the yen sign that carries the semantic meaning. I believe this is why encoding converters often (always?) convert 0x5C in Shift-JIS to U+005C in Unicode, while theoretically, it should be U+00A5. Breaking the look is better than breaking the function.

If WebKit has patches to change U+005C to U+00A5 (or is it just changing the look of U+005C as if it were U+00A5 without changing the backing store? in either case), it is effectively fixing the issue with the encoding converter. Showing a yen sign for Shift-JIS 0x5C is correct behaviour. If my understanding is accurate, there would be no reason to revert it back. For the longer term when we have a smaller number of JIS-based pages, I hope we can fix the encoding converters. It is a harder and riskier change, but it makes things simpler.

c933103 commented 2 years ago

Sorry that it took a long time to respond. I am from the Japanese Language Enablement task force that @r12a mentioned (but I am not speaking for the group) and a former Apple engineer in the OS internationalization group.

As you know, this is a legacy issue. It would not get any better without breaking something at least once. Here’s what I think:

Fonts

I would remove all tricks for fonts. If a web page is in Unicode and uses U+005C, it is a reverse solidus, period. Removing the tricks would break the look of some Windows-centric web pages in a way many people can recognize what’s happening. I would expect some complaints from users, but it is a bug in the web page, and the behaviour is consistent with other browsers on the platform.

Encoding

Theoretically, 0x5C in Shift-JIS is a yen sign and converting it to Unicode should give you U+00A5. As @litherum mentioned, 0x5C has a semantic meaning in some contexts, regardless of the encoding. Unfortunately, if the text is in Shift-JIS, it is the yen sign that carries the semantic meaning. I believe this is why encoding converters often (always?) convert 0x5C in Shift-JIS to U+005C in Unicode, while theoretically, it should be U+00A5. Breaking the look is better than breaking the function.

If WebKit has patches to change U+005C to U+00A5 (or is it just changing the look of U+005C as if it were U+00A5 without changing the backing store? in either case), it is effectively fixing the issue with the encoding converter. Showing a yen sign for Shift-JIS 0x5C is correct behaviour. If my understanding is accurate, there would be no reason to revert it back. For the longer term when we have a smaller number of JIS-based pages, I hope we can fix the encoding converters. It is a harder and riskier change, but it makes things simpler.

Webkit once attempted converting U+005C to U+00A5 and resulted in a number of unexpected behaviors. See https://bugs.webkit.org/show_bug.cgi?id=24906

kidayasuo commented 2 years ago

Webkit once attempted converting U+005C to U+00A5 and resulted in a number of unexpected behaviors. See >https://bugs.webkit.org/show_bug.cgi?id=24906

Is it specific to EUC-JP in which 0x5C is a reverse solidus?

c933103 commented 2 years ago

Webkit once attempted converting U+005C to U+00A5 and resulted in a number of unexpected behaviors. See >https://bugs.webkit.org/show_bug.cgi?id=24906

Is it specific to EUC-JP in which 0x5C is a reverse solidus?

According to http://miau.s9.xrea.com/blog/index.php?itemid=990 , back then the chromium code applied this conversion to EUC-JP together with "Shift_JIS_X0213-2000" but since "Shift_JIS_X0213-2000" would not appear as an option when choosing encoding so only EUC-JP is actually affected by the conversion code.

Also, as described in blogs like https://naruse.hateblo.jp/entry/20100327/1269684858 , a specific problem with converting 0x5C into U+00A5 is that Japanese webpages might write codes that involve 0x5C and have it displayed in Yen sign, but if any non-Japanese person try to copy the code in a browser that converted all of them to U+00A5 then those codes being copied wouldn't work.

kidayasuo commented 2 years ago

If 0x5C is a reverse solidus with EUC-JP, then it should be converted to U+005C and nothing else.

In case of Shift-JIS, from what @litherum described it appears that Shift-JIS 0x0C is converted to U+005C, and then the webkit renders it as if it were U+00A5, leaving the backing store U+005C as is. Is my understanding correct? If so I think this is a good compromise like I mentioned.

c933103 commented 2 years ago

If 0x5C is a reverse solidus with EUC-JP, then it should be converted to U+005C and nothing else.

In case of Shift-JIS, from what @litherum described it appears that Shift-JIS 0x0C is converted to U+005C, and then the webkit renders it as if it were U+00A5, leaving the backing store U+005C as is. Is my understanding correct? If so I think this is a good compromise like I mentioned.

Ah if you mean what the spec of EUC JP said, then there is apparently no single standard, but one of the more referenced document from UI-OSF 日本語環境実装規約 from UI-OSF Japanese Localization Group in 1993 say it is necessary to support either "ISO 646 IRV (ASCII)" OR "JIS X 0201 Romaji 7 bit symbol". At the code point, ASCII is reverse solidus, while JIS X 0201 is Yen symbol.

fantasai commented 2 years ago

It seems that we have two ways forward towards interoperability here:

Option 1: Remove this special behavior from WebKit, and just let the font do what it does on non-Windows platforms. This will result in pages rendering very differently on Windows vs other OS.
Option 2: Encode WebKit heuristics in all browsers, or some improvement over them (e.g. handling Korean); and also possibly add some CSS property and/or font-face descriptor to control the behavior.

Question to the CSSWG is which way we want to move forward.

css-meeting-bot commented 2 years ago

The CSS Working Group just discussed Backslash and Yen.

The full IRC log of that discussion

<TabAtkins> Topic: Backslash and Yen
<TabAtkins> github: https://github.com/w3c/csswg-drafts/issues/6848
<TabAtkins> fantasai: WebKit has a bunch of code to handle this problem - the backslash is often rendered as a Yen sign, especially on Windows systems, for historical reasons
<addison> (or a won sign or a yuan sign; special handling is on non-Windows OS)
<TabAtkins> fantasai: The special handling in WK is they check if they have one fo the "known problematic fonts" or if the encoding is one of the "known problematic encodings", they'll render the backslash as Yen, kinda like a text-transform.
<TabAtkins> fantasai: So for the WG, do we want a heuristic for this like the WK stuff, or do we want to ask them to ignore it?
<TabAtkins> fantasai: Currently text rendered on Windows doesn't look like other OSes, because the Yen characters appear on Windows and not elsewhere.
<TabAtkins> addison: One issue is that it's an expectation of some speakes that the path separate is actually a Yen
<TabAtkins> addison: So actually showing backslash might not be expected by the user.
<florian> q+
<TabAtkins> florian: bc this is a strong expectation in japan/korea, at least, that the backslash doesn't look liekt he backslash, authors will continue author this way, and if it works inconsistently based on the system that's not good. I think we should try for interop.
<TabAtkins> florian: Reason it works on Windows is the fonts; if you don't ahve the right font it won't happen
<addison> s/liekt/like/
<emilio> q+
<TabAtkins> florian: So the font heuristics shoudl probably be standardized
<TabAtkins> florian: On top of that, I think authors should be able to control this more directly.
<drott> q+
<TabAtkins> florian: An @font-face descriptor that lets them say what this font is doing wrt backslash or yen
<TabAtkins> florian: Otherwise pages authored on windows will look different on Mac or Android, etc.
<TabAtkins> florian: So we should aim for interop on at least the heuristics.
<hober> q+
<TabAtkins> florian: And likely go byeond to let authors specify.
<astearns> ack emilio
<TabAtkins> emilio: The compat issue on Mac comes from authros not specifying a font-name, or specifying a font that doesn't exist?
<TabAtkins> addison: The last one.
<TabAtkins> emilio: In taht case, can that font be shipped on Mac?
<addison> I believe locales for JP, KR, TW, and CN regions do this
<TabAtkins> florian: And Android, and Linux...
<TabAtkins> emilio: These kinds of heuristics don't seem terribly appealing to implement... I'd rather not have them, especially if it works on windows due to fonts
<David-Clarke> s/authros/authors/
<addison> This is rooted in DOS code page behavior
<TabAtkins> myles: there's no way we can ship this font on iOS or Mac, that options is off the table
<TabAtkins> emilio: can you explain why?
<florian> q?
<astearns> ack drott
<TabAtkins> hober: fonts are expensive; this would have to be a business arrangement, and the spec probably shoudlnt' rely on that existing for all OSes
<TabAtkins> drott: Can we discuss what th eheuristic would look like? Is it like "this window font as the primary in the font stack"?
<fantasai> WebKit heuristic is described here: https://github.com/w3c/csswg-drafts/issues/6848#issue-1067848361
<florian> q+
<TabAtkins> drott: And second to emilio, I don't think I'd like to carry this legacy issue further by putting it into @font-face
<emilio> +1
<TabAtkins> drott: So probably fint with a heuristic but not futher
<TabAtkins> fantasai: Heuristic is "problematic font? problematic encoding? yes to either, render backslash as yen"
<TabAtkins> iank_: What does "render" mean?
<TabAtkins> fantasai: Do a text-transform, basically.
<TabAtkins> myles: And this is a whole-element thing, looking at primary font, not a glyph-by-glyph sub.
<TabAtkins> drott: So primary font of the element, ok.
<David-Clarke> q+
<TabAtkins> fantasai: So just checking, you're looking at computed 'font-family' and looking at the first font in the list?
<TabAtkins> myles: Probably, yes, I'll need to check.
<TabAtkins> myles: Also this code is older than the Blink fork. But this behavior doesn't happen in Mac Chrome, so presuambly it was deleted?
<TabAtkins> iank_: I'd have to dig into the history, yeah
<JakeA> q+
<TabAtkins> myles: grep for "yen sign" in the webkit codebase to find it in ours
<astearns> ack hober
<David-Clarke> q-
<emilio> https://searchfox.org/wubkat/rev/0c40ba62b482511fe03646f1d4982efd727475dd/Source/WebCore/rendering/RenderText.cpp#1345-1346
<emilio> iank_: ^
<TabAtkins> hober: Sometimes we need heuristics for compat reasons, but that makes me nervous
<TabAtkins> hober: So one, heuristic should be as simple as possible.
<TabAtkins> hober: And two, spec is as a floor; this is the minimum for compat, but you might do better, and hopefully bring it to the group to specify.
<astearns> ack florian
<andreubotella> q+
<TabAtkins> florian: I'd like to push back against this as legacy. It's a compat problem, but it's not rooted in the past; it's the current expectation for Korea and Japan users
<TabAtkins> florian: I don't think a German user wants their path filled with Yen signs, so it's not spreading.
<drott> q+
<TabAtkins> florian: The primary offender is Windows system fonts, but Japanese fonts are often *written* this way.
<emilio> https://searchfox.org/wubkat/rev/0c40ba62b482511fe03646f1d4982efd727475dd/Source/WebCore/platform/graphics/FontCache.cpp#457 is the font-family check
<TabAtkins> florian: So if your primary font is one of these manually-authored fonts, but it didn't download, you'll have this problem again.
<dholbert> q+
<TabAtkins> florian: So that's why my suggestion for @font-face exists, so authors can tell us what their font expects and we can do the right thing just like if it was a windows system font.
<florian> q?
<TabAtkins> myles: Right now there are zero browsers that do this on a font-by-font basis; putting it in @font-face would mean we have to invent a new mechanism
<astearns> zakim, close queue
<Zakim> ok, astearns, the speaker queue is closed
<drott> q-
<JakeA> q-
<TabAtkins> florian: I don't see why that would be required - the descriptor would just tell you whether the font is one that needs the transform
<TabAtkins> (I assume this would just cause the font to trip the "problematic font?" part of the heuristic.)
<astearns> ack andreubotella
<TabAtkins> andreubotella: What happens on Windows if you copy text with this behavior?
<TabAtkins> addison: They'll copy a backspace, that's the acutal character
<TabAtkins> andreubotella: Is that what authors expect?
<TabAtkins> addison: They'll see it as a Yen when they paste, as well. The only difference will be if they put it in a message and send it elsewhere.
<TabAtkins> JakeA: Note that text-transfo9rm doesn't usually do that; it typically changes what you copy
<TabAtkins> fantasai: For some values; for other values (like fullsize-kana) you do *not* copy the transformed version of the text.
<astearns> ack dholbert
<addison> s/They'll/Windows users will/
<iank_> Our bug relating to this: https://bugs.chromium.org/p/chromium/issues/detail?id=305827
<fantasai> And I would argue that text-transform should never copy out the transformed text
<TabAtkins> dholbert: Does the text-transform handle the scenario where the font has a backslash but not a yen character?
<TabAtkins> myles: We don't handle that; rather, I don't know what happens, but we don't seem to do anything special.
<drott> q+
<addison> Not just yen sign... sometimes the won sign (in Korea)
<TabAtkins> dholbert: So possibly rendering tofu rather than slash
<drott> q?
<drott> q-
<TabAtkins> myles: Probably just doing a fallback, like text-transform:uppercase on a font with only lowercase will trigger fallback
<emilio> Gecko also implemented this hack and removed it: https://bugzilla.mozilla.org/show_bug.cgi?id=926580
<TabAtkins> drott: I foudn the Chrome bug to remove the hack
<addison> s/foudn/found/
<TabAtkins> drott: There was some question of whether content was being created expecting this, and if we could phase it out
<David-Clarke> All the fonts and encodings shown in the example are Japanese, but this affects Korean?
<TabAtkins> astearns: Seems the answer is yes, and no.
<TabAtkins> florian: When Windows reaches 0 market share...
<TabAtkins> myles: I asked another group earlier this week - in Japanese keyboards, is there separate backslash and Yen keys? If not, what keycode comes out?
<TabAtkins> florian: Think the answer is "depends on the keyboard"
<TabAtkins> astearns: Probalby enough discussion - is there appetite from gecko and blink to specify a simple heuristic and implement it?
<TabAtkins> iank_: I think the problem is... i'm looking at the patch that removed this, and it's a lot bigger than a text-transform.
<TabAtkins> iank_: Appears to be special code...
<TabAtkins> astearns: Re: tess's point, the rule should be simple and you should implement it as you want.
<TabAtkins> chris: the backing store is a backslash
<TabAtkins> fantasai: I'd said before it depends on the transform
<TabAtkins> myles: from a user perspective it looks like a Yen, so they'd expect a Yen
<addison> (if you paste into a cmd shell, you want the path separator to be a path separotr)
<TabAtkins> florian: Unsure that's true, to the user it's a path separator, and that's a specific codepoint
<TabAtkins> emilio: I'd want to check with jfkthame and the Japanese folks
<TabAtkins> emilio: But given we removed this a while ago, I don't think so. We could reconsider, I guess.
<TabAtkins> drott: The previous patch didn't seem to be text-transform based
<TabAtkins> ACTION drott to investigate for Chrome
<TabAtkins> ACTION emilio to investigate for Firefox

drott commented 2 years ago

For future reference, this was the bug #305827 in which Blink removed the previous mechanism, which was not text-transform based.

annevk commented 1 year ago

@emilio and @drott did you ever complete your investigation? Would be good to have an update here.

drott commented 11 months ago

@annevk, I see from the minutes action items:

ACTION drott to investigate for Chrome ACTION emilio to investigate for Firefox

But I have trouble extracting exactly what were supposed to investigate. I dug out the removed heuristic: https://chromium.googlesource.com/chromium/src/+/2b4002d072669ff145b0afc1a47272503e3d7879%5E%21/#:~:text=%2DFontTranscoder%3A%3AConverterType-,FontTranscoder%3A%3AconverterType,-(const%20FontDescription%26%20fontDescription

And as much as I understand from this 10 year old change: We did seem to modify an internal buffer, different from what we do for text-transform, but did some extra logic for clipboard handling so that we would not copy out modified strings, but original strings.

The logic we used was: do this for a list of specified fonts, and if no primary font is specified at all.

@annevk - were you curious about anything else?

Looking at the code, I am not very comfortable adding something like this back in - especially the part about keeping a modified representation of the source text internally and having to special case clipboard operations.

w3c / csswg-drafts