w3c / csswg-drafts

CSS Working Group Editor Drafts
https://drafts.csswg.org/
Other
4.46k stars 657 forks source link

[css-text-3] Discarding Line Breaks Adjacent to Ambiguous Characters #5017

Open fantasai opened 4 years ago

fantasai commented 4 years ago

The discussion in #337 has veered off in a wide variety of directions, but @hax originally filed the issue to bring up the question of "ambiguous" characters, i.e. those which are commonly used both within and outside Chinese and Japanese context:

https://drafts.csswg.org/css-text-3/#line-break-transform

Otherwise, if the East Asian Width property [UAX11] of both the character before and after the line feed is F, W, or H (not A), and neither side is Hangul, then the segment break is removed.

As this rule, common use cases of quotation marks in Chinese

简体中文的
“引号”
两边不应该有空格。

will have unexpected spaces, because quotation marks are A.

Ideally, we should consider the language information of the context. If the context is East Asian language, A should be treat as W. Even in the unknown language context, if any side of the line feed is A and other side is F, W or H, the segment break should also be removed.

We decided to switch to a Unicode Block listing instead of relying on the East Asian Width property (in particular due to some backwards-incompatible changes on Unicode's side). The current draft does not have a concept of ambiguous characters: all characters are strong "discard" or "don't discard", with discarding behavior requiring both sides of the line break to be "discard".

We might want to consider classifying some characters as "ambiguous", particularly symbols and maybe also the few common punctuation marks used in Chinese (double quotes, specifically). These could defer to the character on the other side, and if both are ambiguous, default to "don't discard".

Do we want to do this? If so, should it be language-dependent or universal?

macnmm commented 4 years ago

My recent thoughts in adapting our J-specific layout to be more universal, I am finding it easier to support several aspects of conflicting conventions by thinking of them as language-specific and introducing the idea of a mode, often at the paragraph level. Embox-based line heights and leading versus ascent-descent; space character widths between Korean words versus in English; how to treat justification between Latin and CJK for JISx4051 versus for English; and these ambiguous Unicode characters, where the font (default glyph) used or the paragraph convention preferred, causes a conflict in desired behavior. There are two dimensions of this conflict -- I am not convinced that we only need to look at the code points around these characters to derive desired behavior. I suspect that in one paragraph mode there is one default for these characters in context of their surroundings, but that in another mode the answer is different, even with the same surrounding characters. Even though shaping can be done for every language in the same way in the line, the way that line is arranged with spacing adjustments and vertical placement for leading or gridding could change if one decides the containing paragraph should follow Japanese conventions overall versus some other language's conventions. I think they need to be independent. However I could be wrong in that the need I see could be an artifact of limited technology giving rise to the Japan-specific convention for mixed text that differs from English conventions for mixed text.

kojiishi commented 4 years ago

Most fonts I know have Latin double quotes by default, so I think the space is desired. It becomes fullwidth only when author sets fwid OpenType feature. Maybe Chinese fonts are different, but I'm hesitate to build rules into the spec based on what current major fonts have.

fantasai commented 4 years ago

After discussion with @kojiishi, particularly about cases where the quotation mark is used adjacent to spaces intentionally, I agree that we should not make double quotes Ambiguous here.

Question still stands about whether we should do this with Symbols generally, such as Emoji and and characters in the Enclosed Ideographics block.

r12a commented 4 years ago

Which aspect of ambiguity are we talking about here? a. a space may be added where not wanted, eg. 支持W3C实现“\尽展 b. a space may be removed when it should stay, eg. 空格字符“ \”不可见

Whichever applies, i suspect that my comments at https://github.com/w3c/csswg-drafts/issues/4992#issuecomment-621265490 about using common-sense or internationalised apps for line breaking may apply.

But Chinese & Japanese use a quite a lot of characters that are not in the discard blocks, not just quote marks. I've actually been trying to make lists of what characters are used by what language, and fwiw currently i have the following from non-discard blocks:

Chinese Basic Latin | 21 | !​"​#​%​&​(​)​*​-​.​/​:​;​?​@​[​\​]​_​{​} General Punctuation | 25 | ‐​‑​–​—​―​‖​‘​’​“​”​†​‡​‥​…​‧​‰​′​″​‵​※​‾‼⁇⁈⁉ Latin-1 Supplement | 2 | §​·

Japanese Basic Latin | 20 | !​"​#​%​&​(​)​*​-​.​/​:​;​?​@​[​\​]​{​} General Punctuation | 20 | ‐​—​―​‖​‘​’​“​”​†​‡​‥​…​‰​′​″​※‼⁇⁈⁉ Latin-1 Supplement | 2 | §​¶

r12a commented 4 years ago

fwiw, just added the following to my lists above: ‼⁇⁈⁉ (sorry about the emojification)

kidayasuo commented 4 years ago

@fantasai I know it is a CSS novice question but may I? Even with English, there are cases where you would not want a space inserted between two characters. Examples include between quotation marks or parentheses and its contents, around EM DASH depending on the style, before a colon, etc. Does the rule has an expectation that authors would not fold a line at certain places, i.e. where they do not want a space inserted?

kidayasuo commented 4 years ago

a more interesting case in English is a semi compound word connected by a hyphen, e.g. "well-known". As they look two separate word one might insert a line break after the hyphen, exactly like hyphenated words. If the Segment Break Translation Rule inserts a space after the hyphen it would be different from the naive user's expectation.

What I am trying to understand is what we can naturally expect from the segment break transformation rules.

xfq commented 4 years ago

a more interesting case in English is a semi compound word connected by a hyphen, e.g. "well-known". As they look two separate word one might insert a line break after the hyphen, exactly like hyphenated words. If the Segment Break Translation Rule inserts a space after the hyphen it would be different from the naive user's expectation.

Yes, this is indeed possible. Here's a test (you may need to adjust the width of the test box in your browser, because the result may be different depending on your font).

We may be able to add some informative notes/examples (for line break on hyphens, line break in Chinese/Japanese and Western mixed text, etc.) to the css-text spec, and point it to authors and authors of authoring tools.

xfq commented 4 years ago

FWIW, I just tried the hard line wrap functionality in some editors (both in HTML mode and in plain text mode) and here is the result.

The text is:

This is long line to be wrapped with a hyphenated word to illustrate the word-hyphenation issue.

In Vim with textwidth=80, gq produces the following result:

This is long line to be wrapped with a hyphenated word to illustrate the
word-hyphenation issue.

That is, Vim won't break on hyphens. (Relevant code)


In Sublime Text 2 and 3 with Sublime Wrap Plus, there is a WrapPlus.break_on_hyphens setting, and its default value is false (it was true before 2018, though), so the result (with wrap_width=80) is:

This is long line to be wrapped with a hyphenated word to illustrate the
word-hyphenation issue.

That is, no break on hyphens.


In Visual Studio Code with Rewrap (with rewrap.wrappingColumn=80), it won't break on hyphens:

This is long line to be wrapped with a hyphenated word to illustrate the
word-hyphenation issue.

In Emacs with (setq-default fill-column 80), fill-paragraph won't break on hyphens as well:

This is long line to be wrapped with a hyphenated word to illustrate the
word-hyphenation issue.

(Relevant code)

asmusf commented 4 years ago

Editors oriented at programmers may possibly be more protective of hyphens because of their use in certain types of identifiers. That would be different from regular text.

xfq commented 4 years ago

Editors oriented at programmers may possibly be more protective of hyphens because of their use in certain types of identifiers. That would be different from regular text.

I think most editors just treat hyphen as a part of the word, and won't explicitly protect it from line-breaking (i.e., won't treat hyphens specially, unless the user changes the config and/or provides a customized function for hard wrapping).

But regardless of what the editors do, my point (as in https://github.com/w3c/csswg-drafts/issues/5017#issuecomment-627723036 , and I believe it's also @kidayasuo's point) is that we should document where we expect the authors (and authoring tools) to break lines in the source code, maybe as informative notes with examples, in addition to the existing examples in https://drafts.csswg.org/css-text-3/#line-break-transform

fantasai commented 4 years ago

@r12a:

Which aspect of ambiguity are we talking about here? a. a space may be added where not wanted, eg. 支持W3C实现“\尽展 b. a space may be removed when it should stay, eg. 空格字符“ \”不可见

This issue is about trying to address a) without introducing problems with b).

@kidayasuo: Soft wrapping lines of text for presentation handles breaks at emdash and hyphens OK, because there isn't actually any space introduced, so rewrapping at a different point doesn't have a problem. When wrapping source code, though, we are introducing characters at the break. The historical rules here are not very sophisticated: if there is a break in the source, it is turned into a space. This is OK for space-separated languages: it is easy to find a place to break at a space nearby, and we know to only break at spaces in our source code even if it is sometimes a little off from the ideal breakpoint if we were wrapping text for presentation.

CSS3 is trying to make this a bit more sophisticated, so that Chinese and Japanese can benefit from being able to break within paragraphs in their source code also. Since these languages don't use spaces at all, under the historical rules they cannot break at all, otherwise it introduces a space.

Because historically breaks became spaces, the new rules are biased to be "conservative" in that they will apply the historical behavior of turning the break into a space when it's not clear whether to discard or not. The currently proposed rule (which is similar to what Firefox currently implements) thus says, if both sides of the break are CJ, discard. If either side is not, use a space.

This issue is about fine-tuning those new rules, specifically about whether to introduce a concept of an "ambiguous" character, which defers to the opposite side whether CJ or not (and which defaults to space if the other side is also ambiguous). The primary use case for this would be Symbols.

fantasai commented 4 years ago

Agenda+ to propose deferring this issue to L4.

litherum commented 4 years ago

When I argued for using Unicode blocks in https://github.com/w3c/csswg-drafts/issues/337, it was exactly for the reason that we wouldn't have to do this sort of codepoint-by-codepoint analysis in the spec and in UAs.

I think this entire feature is not worth the complexity it seems to require.

kojiishi commented 4 years ago

+1 to @litherum but let me add another reason.

Adding more heuristic rules have both pros and cons for authors. I think this case is net-negative for authors.

As @r12a pointed out in his comment in #4992:

I'm inclined to think that we need to expect authors to make adjustments sometimes to resolve ambiguities.

In order for authors to perform the adjustments, the predictability is critical. Authors must run the rules in their brain and predict whether a space is inserted or not, whenever they see new lines in their HTML files or when reviewing someone else's HTML files. Adding more heuristic rules has a negative impact to this process.

If the heuristic rules are accurate enough so that authors don't have to worry about the adjustments, adding rules is good for them, but by now I think we agree that it is not technically possible. If we expect authors to make the adjustments, keeping the rules simple is critical for them.

css-meeting-bot commented 4 years ago

The CSS Working Group just discussed [css-text-3] Discarding Line Breaks Adjacent to Ambiguous Characters, and agreed to the following:

The full IRC log of that discussion <dael> Topic: [css-text-3] Discarding Line Breaks Adjacent to Ambiguous Characters
<dael> github: https://github.com/w3c/csswg-drafts/issues/5017
<dael> fantasai: Prop was defer to L4
<dael> Rossen_: That's an easy proposal
<dael> Rossen_: Any reason not to defer?
<dael> Rossen_: Not hearing any reasons
<dael> Rossen_: Objections to defer the behavior of Discarding Line Breaks Adjacent to Ambiguous Characters to L4
<dael> RESOLVED: defer the behavior of Discarding Line Breaks Adjacent to Ambiguous Characters to L4 of css-text
<dael> Rossen_: Issue has a lot more to discuss so I encourage reengagement