Open spencer246 opened 4 months ago
I agree that the specification does not cover this case of canonical equivalents between compatibility characters and their equivalent codepoints or codepoint sequences. In particular, the compatibility characters are likely to be outside the effective character map.
We should define this case in the spec.
@jfkthame @drott thoughts?
We discussed this at the JLReq TF meeting on 2024-9-25.
Ideographic characters in the compatibility area are typically used to precisely spell proper nouns, such as 高田 vs. 髙田. This is similar to the spelling variation between ‘Smith’ and ‘Smithe’ in English family names. We now have IVS, a better mechanism to express such variations. However, not only do we need to continue supporting existing data, but it is also likely that compatibility ideographs will continue to be used.
For this reason, the JLReq TF believes that compatibility ideographs should be treated similarly to variation sequences in order to preserve the intended variations.
It seems that CSS Font Module does not fully specifies how browsers should select an appropriate font for a grapheme if (1) a grapheme consists of a single Unicode codepoint X, (2) X is canonically decomposable into codepoint Y, and (3) the font can render only Y but not X.
Note that the condition that a grapheme consists of a single codepoint is important here, because Section 5.3 of the spec mandates that if a grapheme was a multiple-codepoint sequence whose NFC normalization is Y, browsers must check whether the font can render Y before they move on to the next font in the
font-family
list.However, it remains unclear whether the rule in Section 5.3 should be applied as well in the case where a codepoint does not belong to a multi-codepoint grapheme cluster or a Unicode variation sequence. In fact, Chrome and Firefox do not agree on this issue; the two browsers render the following simple HTML+CSS snippet differently.
https://codepen.io/spencer246/pen/bGPdqdQ
The above page tries to render
U+F992
, a CJK-Compatibility character which canonically decomposes intoU+6F23
using Noto Sans TC. There are a lot of fonts that coverU+6F23
but notU+F992
, and Noto Sans TC is one of such fonts.In the above figure, the first glyph is
U+F992
and the second isU+6F23
.On FireFox, since Noto Sans TC cannot render
U+F992
, it renders it with the next font (text-security-circle
) in the font stack, which renders any codepoint as a small circle.On Chrome, however, when the engine notices that Noto Sans TC cannot render
U+F992
, it checks whether it can render the canonically equivalent codepointU+6F23
, and thusU+F992
is rendered as a CJK Ideograph rather than a small circle.2-1. If it is, the spec should be explicit about its behavior as to how a font is selected for canonically decomposable Unicode characters.
2-2. If it is not, please consider specifying a desired behavior. In my opinion, FireFox-like behavior is desired to match with the variation sequence case:
To be consistent with the above, a canonically decomposable character (e. g. a CJK Compatibility Ideograph) should be matched against all fonts in the
font-family
list before NFC or NFD is applied to it.