[css-text][text-spacing] Extra spacing between ideographs and non-fullwidth punctuation/symbols

xfq commented 1 year ago

In the process of trying Chromium's implementation of text-autospace, some interested Chinese developers found an issue: there is no extra spacing between ideographs and non-fullwidth punctuation/symbols. In many cases, this results in unbalanced spacing around embedded non-CJK text in CJK languages.

Examples:

input[type="text"]选择器将选择所有type属性为text的input元素。

在HTML中按语言修改样式的最佳方法是使用CSS的:lang()选择器。

C#是微软公司发布的一种由C和C++衍生出来的面向对象的编程语言，运行于.NET Framework和.NET Core之上。

使用!important是一个坏习惯，应该尽量避免。

我们可以用@font-face来指定自定义字体。

可选链?.是一种访问嵌套对象属性的安全的方式。即使中间的属性不存在，也不会出现错误。

符号^和符号$在正则表达式中具有特殊的含义。

正则表达式中的\b表示词边界。

42%代表百分之四十二，1‰代表千分之一，|a|=2代表a的实际值是±2

{1,2}是{1,2,3}的子集。

在三亚15℃太冷了！ (U+2103 instead of U+00B0 + U+0043)

However, not all non-fullwidth punctuation/symbols require extra spacing. For example, footnote marks like *, †, ‡, and ◊ should not have the extra spacing.

Should we add a new value ideograph-symbol (the name and specific design can be discussed later) to cover this situation? This value may not cover all situations, but it can cover some common ones. For uncommon cases, it would be nice to have a mechanism for author customization.

frivoal commented 1 year ago

Hmm, interesting. Many of the use cases you showed above look like things that belong in a <code> element. For those, I'd suggest taking advantage of this spec requirement:

At element boundaries, the amount of extra spacing introduced between characters is determined by and rendered within the innermost element that contains the boundary.

and doing something like this:

code {
    text-autospace: no-autospace;
    padding: 0 0.125em;
}

But not all fit in that pattern.

在三亚15℃太冷了！ (U+2103 instead of U+00B0 + U+0043)

This suggests that:

maybe we should operate on the NFD form, or
maybe we should include the letter-like symbols in the definition of non-ideographic letters.

For the rest:

C# C++ .NET Framework 42% 1‰

Should we add a new value ideograph-symbol

Maybe? That could be a solution.

Is this a case of symbols that must always be autospaced (when autospacing is on)? If so, we should probably just do it.

Does it depend on something which the author is aware of, but that the user agent cannot easily infer? if so, a new value ideograph-symbol is probably the solution.

Does it depend on whether they're next to a string of non-ideographic letters/numbers? If so, it might suggest we need to treat the as some kind of ambiguous/neutral group, that gets grouped together with a string of non-ideographic letters/numbers if any is there, but doesn't introduce spacing by itself if found without non-ideographic letters/numbers

For example, If 永 represents ideographs, a represents non-ideographic letters, + represents neutrals (like #, +, %, ., etc), and _ represents autospacing:

永a永 would result in 永_a_永
永+永 would result in 永+永
永+a永 would result in 永_+a_永
永a+永 would result in 永_a+_永
永+a+永 would result in 永_+a+_永
永+永a+永 would result in 永+永_a+_永

Also, regardless of how we handle that category, as you mentioned that not all symbols would fit into that category, I am a little unsure about how we'd go about maintaining the list of those that do and those that don't.

kojiishi commented 1 year ago

Note this was raised to JLTF a while ago but it didn't get much attentions there. I'll ping again.

xfq commented 1 year ago

Hmm, interesting. Many of the use cases you showed above look like things that belong in a <code> element. For those, I'd suggest taking advantage of this spec requirement:

At element boundaries, the amount of extra spacing introduced between characters is determined by and rendered within the innermost element that contains the boundary.

and doing something like this:
code {
    text-autospace: no-autospace;
    padding: 0 0.125em;
}

Why is it rendered within the innermost element that contains the boundary (i.e. padding) instead of margin? If there is a background color in the code element, I think what I would expect to see is that the background in the extra spacing is not filled with background color.

But not all fit in that pattern.

在三亚15℃太冷了！ (U+2103 instead of U+00B0 + U+0043)

This suggests that:

maybe we should operate on the NFD form, or

Maybe. I don't have a counterexample now.

maybe we should include the letter-like symbols in the definition of non-ideographic letters.

Maybe, but I'm not quite sure about code points like U+2122 (Trade Mark Sign). I personally don't think the extra spacing is needed for it, but I would like to discuss it with the clreq group.

For the rest:

C# C++ .NET Framework 42% 1‰

Should we add a new value ideograph-symbol

Maybe? That could be a solution.

Is this a case of symbols that must always be autospaced (when autospacing is on)? If so, we should probably just do it.

Does it depend on something which the author is aware of, but that the user agent cannot easily infer? if so, a new value ideograph-symbol is probably the solution.

Does it depend on whether they're next to a string of non-ideographic letters/numbers? If so, it might suggest we need to treat the as some kind of ambiguous/neutral group, that gets grouped together with a string of non-ideographic letters/numbers if any is there, but doesn't introduce spacing by itself if found without non-ideographic letters/numbers

For example, If 永 represents ideographs, a represents non-ideographic letters, + represents neutrals (like #, +, %, ., etc), and _ represents autospacing:

永a永 would result in 永_a_永

永+永 would result in 永+永

永+a永 would result in 永_+a_永

永a+永 would result in 永_a+_永

永+a+永 would result in 永_+a+_永

永+永a+永 would result in 永+永_a+_永

Also, regardless of how we handle that category, as you mentioned that not all symbols would fit into that category, I am a little unsure about how we'd go about maintaining the list of those that do and those that don't.

I agree that sometimes there is ambiguity, and I'll discuss it with the clreq group.

frivoal commented 1 year ago

Why is it rendered within the innermost element that contains the boundary (i.e. padding) instead of margin?

No particular reason, authors could do what they prefer. I guess my choice here was influenced by the default GitHub style which includes some inline padding in <code> elements.

kojiishi commented 1 year ago

/cc @clqsin45 @nt1m @vitorroriz

Clqsin45 commented 1 year ago

Is it possible to somewhat involve UNICODE TEXT SEGMENTATION? I think many of the examples indicate that user perceptions can be different from predefined symbols in the real world, and it is hard to figure out a perfect solution, as it is natural language which can never have a 100% correct algorithm.

Fortunately they are usually consecutive, so I guess SEGMENTATION, or the grouping logic mentioned by https://github.com/w3c/csswg-drafts/issues/9479#issuecomment-1774309893 , should improve the situation.

xfq commented 1 year ago

Could you provide an example of how to use UAX #29 for this use case? Are you referring to the word break algorithm or something else?

yisibl commented 1 year ago

Is this a case of symbols that must always be autospaced (when autospacing is on)?

@frivoal Yes!

Considering that Chrome is in the process of implementing text-autospace, and in order to provide better default typography before it ships, I suggest that the specification, at the current level, only consider adding ideograph-symbol. This value will by default only add symbols that are common in Natural language.

Temperature symbols: ℃（U+2103）, ℉（U+2109）, °
Math symbols: %, ‰, ‱ (U+2031), +, -(U+002D), −(U+2212), ±, ∓
Currency symbols
Letterlike Symbols: It may be necessary to pick only some of these symbols.

It looks like Apple's OS takes a similar approach, for example:

中心城区在-15至-20℃之间。（From here）性能提升25%以上。 C#是一种由C和C++衍生出来的面向对象的Smashing语言，运行于.NET Framework和.NET Core之上。

@fantasai Do you know the exact rules for adding space in iOS?

In the absence of a suitable algorithm, in the future it might be worth considering using a @counter-style-like syntax to customize the rules.

@kojiishi If the specification defines a rule for this, would you prioritize implementing it?

kojiishi commented 1 year ago

Including symbols makes sense to me, but we probably don't want to include all gc=S*, do we? We'll need to review which one to include and which one not to. During that, we'll need to make sure it doesn't insert spacing to where we don't expect.

By seeing multilpe feedback to the character classes coming up, I'm leaning towards moving this definition to Unicode as I commented on PR#9503. Doing so should make discussing with Unicode experts easier, and maintaining the list should be easier too.

Regarding the syntax, as several issues coming up and there are some uncertainty, I think it's better to step back rather than adding more. One idea is including them to both sets without adding a new value. Another idea is to defer detailed classifications of letters and numerals to future versions and start with normal only (IIUC that's what iOS/macOS does.) There may be more ways, but stepping back will allow us to think about designs more after impls ship and hearing the web authors feedback.

/cc @nt1m @vitorroriz @Clqsin45 @kidayasuo

kidayasuo commented 1 year ago

I surely believe I am missing some important points, but what is the cause of this oddity?

some interested Chinese developers found an issue: there is no extra spacing between ideographs and non-fullwidth punctuation/symbols

With the text-autospace: normal property, I thought a small space would be generated between 'ideographs' and characters that are not. This two-state machine should prevent the imbalance that was mentioned. I apologize for the interruption, but I would greatly appreciate it if you could clarify where my misunderstanding lies.

xfq commented 1 year ago

@kidayasuo The current default behaviour is ideograph-alpha ideograph-numeric, meaning there is only extra spacing between ideographs and non-ideographic letters/numerals, but there's no extra spacing between ideographs and non-fullwidth punctuation/symbols.

For example, there's no extra spacing for the colon (:), parentheses, "hash sign" (#) and plus signs (+) and the ideograph next to them in the picture below:

yisibl commented 1 year ago

Another idea is to defer detailed classifications of letters and numerals to future versions and start with normal only (IIUC that's what iOS/macOS does.)

@kojiishi Apple's normal adds more symbols, such as space after % in iOS screenshots. This requires them to share the exact rules.

kidayasuo commented 1 year ago

@xfq Thank you. Got it. Do you know why ideograph-alpha and ideograph-numeric are created when "non-ideograph" might be all what you need? I can't think of scenarios where you create a space only with letters, or only with numbers. They surely do create unbalanced spacing because there are words that start with one kind and end with a different kind like we are seeing with the examples.

If they are truly useful and needed despite added complexities, I agree ideograph-symbol, or actually ideograph-everything-else would be necessary. And a definition of non-ideograph that covers all characters that are not ideographs would also be super useful.

kojiishi commented 1 year ago

Another idea is to defer detailed classifications of letters and numerals to future versions and start with normal only (IIUC that's what iOS/macOS does.)

@kojiishi Apple's normal adds more symbols, such as space after % in iOS screenshots. This requires them to share the exact rules.

Right, thanks. Yes, I mean if Apple can disclose it. Sorry if my comment didn't read that way.

xfq commented 1 year ago

@xfq Thank you. Got it. Do you know why ideograph-alpha and ideograph-numeric are created when "non-ideograph" might be all what you need? I can't think of scenarios where you create a space only with letters, or only with numbers. They surely do create unbalanced spacing because there are words that start with one kind and end with a different kind like we are seeing with the examples.

If they are truly useful and needed despite added complexities, I agree ideograph-symbol, or actually ideograph-everything-else would be necessary. And a definition of non-ideograph that covers all characters that are not ideographs would also be super useful.

I agree that adding extra spacing only between ideographs and non-ideographic letters, or only between ideographs and non-ideographic numerals is not useful. However, there are some characters that should not have extra spacing between ideographs and them, such as:

some Chinese/Japanese punctuation, like 。、，：；！？「」（）《》——……
footnote marks like *, †, ‡, and ◊
emoji
whitespace characters

There are also some characters that I'm not sure, such as Taixuanjing symbols (like U+1D300), mahjong tiles (like U+1F000), Xiangqi symbols (like U+1FA60), copyright/copyleft signs, and so on.

kidayasuo commented 1 year ago

However, there are some characters that should not have extra spacing between ideographs and them, such as:

I agree. So, it seems we need 'neutral'? Do we need right/left directionality? Such neutrals create unbalanced spacing if/when they are used as a part of a word or a phrase. So, we might want to limit them to some small number. If the amount of space is small like 1/8 of a fullwidth like Apple does, we might be able to say ok to create a space for some edge cases.

xfq commented 10 months ago

Based on our discussion in yesterday's clreq teleconference, we think it would be useful to make this behaviour language-dependant because of the difference in conventions between Chinese and Japanese.

For example, in Japanese, it's normal to have extra spacing before "12" but not after "%" in the phrase "永永永12%永永永". However, in Chinese there's extra spacing after "%".

kojiishi commented 10 months ago

@xfq Thanks for the info. I haven't checked with JLREQ folks, but I don't think this is language dependent. If the text is "永永永12%永永永" then I believe Japanese expects spacing after "%" too.

The complexity of handling punctuation and symbols is that it depends on the context, but supporting longer context slows down the layout engine quite severely.

Imagine "永永永12%永永永" and "永永永X%永永永" with the CSS text-autospace: ideograph-numeric. Ideally, I hope you agree, we want the spacing after "%" for the first case but not for the second. Doing this requires more context than adjacent two characters, and this could be longer, such as "永永永mininum-maximum%永永永". They could also appear alone, such as when "how many % is this?" ("何%ですか?" in Japanese).

It should be a bit simpler if CSS doesn't distinguish ideograph-numeric and ideograph-alpha, but even if we unite them, there are always edge cases, similar to the UAX#9 Bidi Algorithm isn't always perfect.

The discussion should move to Unicode once the proposal is accepted, and I hope we can find a good balance of desired results, complexity, and performance there.

xfq commented 10 months ago

@xfq Thanks for the info. I haven't checked with JLREQ folks, but I don't think this is language dependent. If the text is "永永永12%永永永" then I believe Japanese expects spacing after "%" too.

I got this information from https://github.com/w3c/jlreq/issues/387 :

敏先生: アンバランスになる問題について：「これは12%です」という文で12の前は開けるが%の後は開けない、というのは日本語では普通。なので、アンバランスが即悪いわけではないのでは？（敏先生）

Although I'm not sure which bahaviour more common / expected.

The complexity of handling punctuation and symbols is that it depends on the context, but supporting longer context slows down the layout engine quite severely.

Imagine "永永永12%永永永" and "永永永X%永永永" with the CSS text-autospace: ideograph-numeric. Ideally, I hope you agree, we want the spacing after "%" for the first case but not for the second. Doing this requires more context than adjacent two characters, and this could be longer, such as "永永永mininum-maximum%永永永". They could also appear alone, such as when "how many % is this?" ("何%ですか?" in Japanese).

It should be a bit simpler if CSS doesn't distinguish ideograph-numeric and ideograph-alpha, but even if we unite them, there are always edge cases, similar to the UAX#9 Bidi Algorithm isn't always perfect.

Indeed.

The discussion should move to Unicode once the proposal is accepted, and I hope we can find a good balance of desired results, complexity, and performance there.

If this is language-dependant, it may be difficult to solve the problem at the Unicode level only. Also, if the rule is defined in a Unicode character property, it's very difficult to change.

IIRC it's on the agenda of UTC 178 this week, so let's see what the Unicode experts think about it.

kojiishi commented 9 months ago

@xfq Thanks for the info. I haven't checked with JLREQ folks, but I don't think this is language dependent. If the text is "永永永12%永永永" then I believe Japanese expects spacing after "%" too.

I got this information from w3c/jlreq#387 :

敏先生: アンバランスになる問題について：「これは12%です」という文で12の前は開けるが%の後は開けない、というのは日本語では普通。なので、アンバランスが即悪いわけではないのでは？（敏先生）

Although I'm not sure which bahaviour more common / expected.

Thanks for the link I missed it. I think it's more about style, not language. Probably a diff between traditional print style and online text style.

kidayasuo commented 9 months ago

According to Bin-sensei, the spacing is intended to prevent characters from being too close together, not to highlight words like parentheses do. Such 'unbalanced' situations are actually common practice in publications.

Adding the following comment on behalf of Bin-sensei: Above comment does not of course preclude using a pair of spaces to highlight a word. There is nothing wrong of doing so. It just says that such usage is not a common practice.

taroyamamoto-451 commented 8 months ago

I disagree with not applying auto-spacing between a Japanese character and a Western punctuation mark. I believe it's not a matter of visual "balance" but a matter of consistency. In fact, as far as I remember, for instance, Morisawa-Linotype's CORA5-E text composition language designed for Linotype CRT/laser typesetters used by Japanese professional typographers allowed auto-spacing between a Japanese character and a Western punctuation. I don't mean you "must" always do it, but it is one of widely accepted conventions in Japanese typography.

macnmm commented 8 months ago

Talking to Ned from Apple, he says their algorithm is quite involved, and allows for both compression and expansion of the default spacing, and some spacing will take the glyph ink into account as part of the logic. So, this leads me to believe that we need to approach this problem with a bit more rigor and nuance. For example:

Solve the Unicode SJIS unification issue with Variation Selectors
Verify the minimum spacing behavior variations and define spacing classes on the Unicode ranges + VSs
Define the spacing behavior patterns for each spacing class pair (what is the minimum, desired, maximum spacing amount; what are the compression and expansion conditions and priorities)
Advocate for fonts to standardize their glyph designs to conform to the Unicode + VSs, and their defined spacing behaviors
Advocate for layout engines to implement the spacing rules according to the spacing behavior variations

All this is to say that the proposal from Koji may not be sufficient to solve the Latin-to-J or Latin-to-CJK spacing issue, and that that issue is merely a single case of the generic spacing rules issue defined in JLReq or JIS X 4051 and so it should try for a higher bar from the beginning.

kidayasuo commented 7 months ago

@macnmm as repeated in the document, the proposal is not an effort to make a definite rule. It is intended to serve as a fallback default when no other information is specified by the higher level protocol.

By having a reliable base, customizations become much easier because your description can be only the diff from the default. It is a benefit of having a stable default.

kidayasuo commented 7 months ago

@macnmm I like the idea of using the variation selectors as a potential solution to the challenges posed by the unification of code points for characters that are used differently in Western texts and in Japanese, despite their inherent distinctions.

My understanding however is, as they are all proposed to be class "O" regardless of if they are fullwidth or proportional, it is an orthogonal issue. May be I am missing something……

macnmm commented 7 months ago

It may be the proposal strayed from where I hoped it would land, but my hope is if you can define a VS with the missing SJIS width, spacing class, and vertical writing posture, you solve the Unicode unification issues for Japanese character behavior in line layout. So, I would say we push for this.

w3c / csswg-drafts

[css-text][text-spacing] Extra spacing between ideographs and non-fullwidth punctuation/symbols #9479