[css-text-4] Avoiding line breaks in the middle of CJK words/phrases

litherum commented 1 year ago

[This issue description focuses on Japanese, but is applicable to Chinese, Japanese, and Korean.]

Historically, in web browsers, line breaks are allowed between most Japanese characters. This is fine, but it can be somewhat easier to read if there weren't line breaks in the middle of Japanese words.

Here's an example: appletv

We've been working on some new line breaking behavior with the intent on keeping words together: appletv

We've found that this makes it a little bit easier to read.

(One interesting side-effect is that, because there are significantly fewer line breaking opportunities throughout the text, the right edge, which used to be flush, now becomes ragged.)

We've enabled this behavior on our upcoming operating systems for UI text and some text content, and it's been working pretty well. However, there is a performance cost to this dictionary-based approach, so we can't simply just turn it on for all content.

We were hoping for the working group's advice about the best way to hook this new behavior up to the web. There are a few possible options, we think:

(The null hypothesis) Don't hook it up; do nothing. Continue to having Japanese line breaking worse on the web than it is in native content.
Hook it up to text-rendering: optimizeLegibility. There isn't really a spec for optimizeLegibility, but it isn't really a very good fit - this isn't really about text rendering.
Hook it up to text-wrap: pretty. This also isn't really a perfect fit, as that value is supposed to be about multiline text breaking. OTOH, any multi-line text algorithm would probably try to keep Japanese words together, so maybe this could be seen as a first baby step toward the full multi-line algorithm.
Hook it up to word-break: keep-all. This is probably the closest fit we have already existing in CSS, as keep-all is meant to keep Korean words together. However, the spec is fairly prescriptive about exactly which characters have line breaking opportunities between them, and is currently incompatible with a dictionary-based approach.
Add a new value, possibly to word-break, or possibly to line-break, to explicitly turn on this kind of line breaking. This is probably the safest approach.
Use some kind of heuristic to apply it to what we detect to be headlines or UI text. I don't think this is a good idea, because any heuristic would probably need a new CSS property anyway to disable the heuristic, and if we're going to add new API surface to the web, we might as well just add a new value that specifically turns this behavior on and off.
Hook it up for all text, unconditionally. This isn't really an option, because of performance; I'm only including it here for completeness.

Here are some more screenshots to see the differences where I hardcoded the new behavior to be enabled (click for full quality). Please note that these line breakers are a work in progress; we are aware of some problems, and are still improving them.

Before / After Before / After Before / After Before / After Before / After

frivoal commented 1 year ago

This is very desirable indeed. Sometimes people want this, as you said, in headlines, or for legibility questions. Sometimes people want this because it helps with some forms of dyslexia, or because it's easier on children and language learners (and in those cases, it is occasionally paired with automated spacing of words).

Historically, in web browsers, line breaks are allowed between most Japanese characters.

Not just in browsers. That's how Japanese is done the vast majority of the time in all mediums, so even if there was no performance concern, we'd still not want to do this by default. But indeed, there are desirable exceptions, so this is a good problem to solve.

This problem is one of the things https://drafts.csswg.org/css-text-4/#word-boundaries is intending to enable, but I must note that this part of the spec has received a bunch feedback, based on which a significant overhaul is needed. I've got a pending action item to try and recast it into something that hangs off word-break (as you suggested may be the easier solution). I think this is likely to work quite well, with some open questions:

For Japanese, the need is typically expressed as wanting line breaks at phrase segment boundaries (文節に / よる / 改行を / 行う), not word boundaries (文節 / に / よる / 改行 / を / 行う), though we may want both modes to let authors pick, depending on context. This could be addressed by two keywords (word-break: […] | auto-words | auto-phrase). Or maybe we don't give a choice, and that's just a quality of implementation question.
If we do have both modes, the phrase segment mode could possibly be applied to languages with spaces as well, including those in the Latin alphabet, as is sometimes used in contexts like headings ("The wizard / of Oz" vs "The / wizard / of / Oz"). Do we want to open that up too, or should this be limited to languages that don't use spaces in the first place? Does this only (selectively) suppress the same kind of wrapping opportunities as keep-all could, or can it also turn spaces into NBSPs?
(One interesting side-effect is that, because there are significantly fewer line breaking opportunities throughout the text, the right edge, which used to be flush, now becomes ragged.)

This raises a interesting question: Authors might want to conditionally apply some other styles when this works, for example justification. If the value of word-break that turns this on does not include a language parameter (which seems desirable), @supports won't help, because the value may be supported in general without the browser knowing how to do it for a particular language. What then? (There is a related problem for hyphenation https://github.com/w3c/csswg-drafts/issues/5530)
What about the other thing that https://drafts.csswg.org/css-text-4/#word-boundaries does (space injection at word boundaries)?

frivoal commented 1 year ago

Side note: based on the examples you've given, your implementation seems to stop doing 禁則処理 ("kinsoku shori", or forbidding breaks near punctuation marks) when you turn this mode on. You can see it for example in the second example, as the third line of the headline starts with a comma. That's unlikely to be desirable. Even if it does help a bit with ragged edges, this looks very unnatural and jarring.

Please note that these line breakers are a work in progress; we are aware of some problems, and are still improving them.

That's probably what you were referring to.

litherum commented 1 year ago

To be clear, the implementation I’ve been discussing is part of the platform, being applied across the OS.

For Japanese, the need is typically expressed as wanting line breaks at phrase segment boundaries (文節に / よる / 改行を / 行う), not word boundaries (文節 / に / よる / 改行 / を / 行う), though we may want both modes to let authors pick, depending on context. This could be addressed by two keywords (word-break: […] | auto-words | auto-phrase). Or maybe we don't give a choice, and that's just a quality of implementation question.

Right, titling this issue to be about “words” was a poor choice of … words. You’re right that phrases the right granularity here. Our current architecture is designed to handle phrases no problem. Therefore, this shouldn’t be an author choice, and we shouldn’t be adding 2 new values.

Authors might want to conditionally apply some other styles when this works, for example justification.

I think this is another argument we should be adding a new value, so that @supports works.

Also, yet another argument in favor of a new value: I don’t want to hook this up to an existing property, and then discover that it created a performance regression in our multilingual benchmarks. The best way to avoid a regression is to have this be opt-in, via a new value.

kojiishi commented 1 year ago

As Florian said above, the word-boundary-detection property allows UA to break at phrase boundaries, and fantasai seems to be against for more explicit control in #6730, so we're working on implemeting the word-boundary-detection property.

I've got a pending action item to try and recast it into something that hangs off word-break (as you suggested may be the easier solution).

If you're planning to refactor, appreciate if you could raise the priority. Blink is close to shipping this property.

litherum commented 1 year ago

Oh, interesting. I didn't know about that property!! That's probably the right solution here.

w3c / csswg-drafts

[css-text-4] Avoiding line breaks in the middle of CJK words/phrases #8920