w3c / csswg-drafts

CSS Working Group Editor Drafts
https://drafts.csswg.org/
Other
4.48k stars 660 forks source link

[css-text-4] Add support for content-detection, Bunsetsu- (the smallest unit of words that sounds natural) or phrases-based line breaking #6730

Closed chrishtr closed 1 year ago

chrishtr commented 3 years ago

Proposal

Add a CSS property that provides a way for developers to specify that they would like to use a phrase-based, content detection-based algorithm for line breaking. Implementations would use this CSS property to trigger use of a library that tries to determine phrase boundaries in text and break lines accordingly.

Example (hopefully I got this right, I don't speak Japanese):

A phrase often consists of multiple words. The following Japanese example consists of 6 words, but has 3 phrases.

名前 中野 です。
My name is Nakano .
Noun Particle Noun Particle Noun Auxiliary verb
Phrase 1 Phrase 2 Phrase 3

Phrase-based line breaking is often desired for headline-type text--text in a graphic display context, usually at large sizes, such as titles, headings, billboards, or advertisement graphics, especially in language such as CJK or Thai.

In some use cases such as accessibility content for children, phrase-based line breaking is also useful in a reading context at regular body text sizes.

Design constraints I know of:

(Note: There may be a use case for a developer overriding this fallback path, e.g. by specifying “word phrase none” as the mode, meaning word line breaking is preferred, falling back to phrase, and then to character.)

(+) Compare with the word-boundary-detection property, which currently requires a language when using the auto keyword. word-boundary-detection has this restriction because it is paired tightly with keep-all, whereas the phrase-based feature is not.

Existing support for word-based line breaking does not quite meet these requirements.

astearns commented 3 years ago

I think that wrap-inside:avoid is meant to be the CSS solution for this situation, but it relies on markup to determine “phrasal” boundaries.

https://www.w3.org/TR/css-text-4/#example-avoid

If it is possible to automate finding phrase boundaries, perhaps the right layer to apply it is to semantic markup.

chrishtr commented 3 years ago

If it is possible to automate finding phrase boundaries, perhaps the right layer to apply it is to semantic markup.

Hi Alan, I'm not sure what you're suggesting here. Are you suggesting that the UA can simulate semantic markup changes in the HTML via an automated process?

astearns commented 3 years ago

No, I was thinking of an authoring step that would do that. Do you have a particular “library that tries to determine phrase boundaries in text” in mind?

chrishtr commented 3 years ago

No, I was thinking of an authoring step that would do that. Do you have a particular “library that tries to determine phrase boundaries in text” in mind?

Yes. There are some libraries under development that can do semantic detection like this.

frivoal commented 2 years ago

@chrishtr Can you have a look at https://drafts.csswg.org/css-text-4/#word-boundaries, and more specifically at https://drafts.csswg.org/css-text-4/#word-boundary-detection

Even though it uses the notion of "word" while you're speaking about phrases, it seems to me that these are very closely related, and aiming for the same (or at least overlapping) use cases.

frivoal commented 2 years ago

Oops, sorry, I just saw that you did mention it. It seems that the main restriction is that you don't want to specify the language. If that can be made to work reliably, a language-agnostic value could be added to that property.

frivoal commented 2 years ago

Anyway, to me, it seems we need more iteration on https://drafts.csswg.org/css-text-4/#word-boundaries rather than a completely separate thing, as there's significant overlap between what that's trying to achieve and what you're proposing.

chrishtr commented 2 years ago

@frivoal iteration on an existing property would be fine. As you mentioned, it's I think necessary not to specify the language (plus the additional implications I mentioned in my original comment way above), and also likely have the fallback semantics I mentioned..

jungshik commented 2 years ago

JFYI, in case of Korean, 'word-break: keep-all' almost works for use cases described in the proposal because Korean does use spaces between words/phrases unlike Chinese and Japanese. 'Almost' (not completely) because compound nouns ("concatenation" of multiple nouns without inter-word space) wound't have line-breaking opportunities with 'word-break: keep-all' because 'word-break: keep-all' is (almost) entirely space-based.

In addition to use cases enumerated in the proposal, web page authors may want to use this type of line-breaking for ragged paragraphs as opposed to justified.

r12a commented 2 years ago

First thoughts, a small correction and then a question or two.

A phrase often consists of multiple words. The following Japanese example consists of 6 words, but has 3 phrases.

名前 中野 です。
My   name is Nakano .
Noun Particle Noun Particle Noun Auxiliary verb
Phrase 1   Phrase 2   Phrase 3  
The table should read: 私 名前 中野 です。
My name topic marker Nakano is.
Noun Particle Noun Particle Proper noun Verb

Note that, linguistically, the topic particle actually describes the whole phrase '私の名前', not just 名前. So we should probably define clearly what we mean by 'phrase'.

My initial suspicion is that this is actually only relevant to Japanese, and aims to prevent particles from wrapping without the preceding word. I think that in most languages attached suffixes are not separated from the word, and spaces are used around both, as mentioned for Korean. (Mongolian has gaps between some words and suffixes, but these are created by dedicated characters such as NNBSP or MVS.)

I'm curious to understand the application for Thai, which i thought doesn't have particles of this kind, and where line break opportunities are generally indicated by heuristics that divide words, or by use of ZWSP. Do you have examples of where Thai needs help?

Do you also have examples of Chinese needing to keep together things that are associated with an adjoining word in this way?

Since this mentions non-CJK languages, is there an idea that languages that separate words with spaces will also need this option?

I find myself wondering whether the issue at hand is rather how word-boundary detection works, and whether instead we should define a property for that. Note, for example, that if you double-click on 名前は the browser usually highlights the compound noun and the particle separately. However, perhaps one could define a property that tells the browser to keep nouns and particles together as a single 'word' unit. That kind of instruction may be more widely useful than just for line-breaking, eg. it may change the 'word' selection behaviour too.

kojiishi commented 2 years ago

@r12a: Thanks for the feedback.

linguistically, the topic particle actually describes the whole phrase '私の名前', not just 名前. So we should probably define clearly what we mean by 'phrase'.

If we were to define it, I'd like to suggest that the definition of the "Bunsetsu" ("phrase" in Japanese) from Wikipedia+Google Translate:

Bunsetsu is the smallest unit (different from words) that does not become unnatural when words are divided into small pieces

Since it's about "natural" line breaking, which is ambiguous, I think both are correct. Multiple Japanese organizations publish different guidelines, and they may produce different results for the same text. I think it is similar to different organizations may define differently whether a word is a noun or a compound noun.

I'm curious to understand the application for Thai...

We're still learning them, sorry, not enough details yet, but we hear Thai and Chinese think the current line breaking is sometimes "unnatural" and want more natural one. For example, if you look at the source HTML of the Apple Thai page, you will find nowrap, &nbsp;, and <wbr> to make "natural" line breaking, for example:

<span class="nowrap">ในคอลเลกชั่น</span> Black&nbsp;Unity&nbsp;ใหม่ <wbr><span class="nowrap">ได้รับแรงบันดาลใจ</span>

Since this mentions non-CJK languages, is there an idea that languages that separate words with spaces will also need this option?

We're still learning them too, but examples from English Apple Card page:

For&nbsp;Apple&nbsp;Card eligibility requirements
Get started<br /> with&nbsp;Apple&nbsp;Card.

It looks like the page author thinks not breaking after "For" or "with", and before product name, is more natural.

it may change the 'word' selection behaviour too.

Do you want to select "with Apple Card" as one word?

macnmm commented 2 years ago

For Russian would you want to keep certain prepositions like «с» (with) together with the word following using this setting?

Also +1 to @r12a ’s comment about there being different ways to break Japanese phrases than just linguistic analysis of the parts. It would seem there should be levels of grouping allowed similar to Kinsoku levels of “weak” and “strong” to cover this.

kojiishi commented 2 years ago

@macnmm:

For Russian would you want to keep certain prepositions like «с» (with) together with the word following using this setting?

Thank you for the feedback. I know nothing about Russian, but if Russian looks more "natural" not to break there, I think it should. The intention of the feature is about "natural" line breaking, so I think the results are likely to vary by languages, or by engines.

It would seem there should be levels of grouping allowed similar to Kinsoku levels of “weak” and “strong” to cover this.

That's an interesting idea, thank you. We can allow the phrase-based line breaking engine to use the line-break property to adjust its level, or add a separate property.

xfq commented 2 years ago

(Disclaimer: although I speak Chinese, I am not familiar with Chinese grammar or the style rules of Chinese publishers. These are just my personal experiences and thoughts.)

@r12a said:

Do you also have examples of Chinese needing to keep together things that are associated with an adjoining word in this way?

One such example would be the classifier and the preceding numeral, e.g., in 三双筷子 there should be no line break between these 三 and 双. I think this is true even when the classifier is Western text (like "m" instead of 米 for metre). Although there may be spacing between Chinese and Western text, there should be no line breaks.

There are similar examples in Japanese, like シャツ三枚.

Another example is the perfective-aspect le (了), which immediately follows the verb. IMHO when used as an aspect marker, le and the preceding verb should not be broken into two lines.

Back to the original discussion, I think whether the result of line breaking is satisfactory is a subjective question, and it is difficult to have a method that works in all cases. We'd better provide a not-too-bad default value and allow developers to customize it (by switching strictness profiles/levels or changing phrase-based line breaking to word-based and modifying it by adding things like <span class="nowrap"> themselves).

r12a commented 2 years ago

In case it produces a faster response than the email i sent to the CSS WG list, let me mention here that the link to https://drafts.csswg.org/css-text-4/#word-boundary-detection has not been working for some time, and i'm unable to read the examples in the text by looking directly at the .bs file. This is making it difficult for me to formulate suggestions for this issue. Can someone reading this fix the link?

himorin commented 2 years ago

There are similar examples in Japanese, like シャツ三枚.

I'm quite not sure whether there is similar processor or not, but for Japanese, MeCab or some similar processor will mark as a counter suffix, and no phrase boundary will be placed there.

r12a commented 2 years ago

Here are some more questions that occurred to me while thinking this through.

  1. Should we have an additional parameter phrase to create the segmentation desired here, or should we aim to convince the people creating the segmentation algorithms that (at least for Japanese & Chinese) they should be segmenting by default on phrases, and therefore we'd add a word parameter to do the opposite, ie. break particles and such apart? Looking at the Chinese examples, it seems like the phrase approach is a better default. Not sure whether there are backwards compatibility issues with that.
  2. I'm finding it hard to see the Thai case as similar. My understanding is that the Thai case is to do with whether or not to aggressively/accurately break compound words. I wonder to what extent that needs linguistic understanding, so that we don't break things that really shouldn't be split (like breaking 'blackbird' in English). It may be that we could provide a preference for situations where the segmentation is down to personal preference, but the segmentation algorithms would need to allow for that choice by beefing up the sophistication of their parsing. But it doesn't seem to me to involve the same set of criteria as keeping phrasal parts together.
  3. I can see the potential for keeping prepositions with associated words in languages that have spaces between words, but i assume that that would need the user agent to start applying linguistic analysis on a language-specific basis to a large number of languages. Is that feasible?

(Btw, fwiw, no-one has mentioned it yet, and i don't remember seeing it in the css-text-4 spec, but if you want to do this kind of thing manually then U+2060 WORD JOINER is your friend. (Does the opposite of ZWSP/<wbr>.))

kojiishi commented 2 years ago

@r12a

I'm finding it hard to see the Thai case as similar.

I think the title and the original description was misleading, sorry about that. The actual intention of this issue is about supporting a "natural" line breaking. From our point of view, handling particles as part of a phrase is an example for Japanese to explain what it wants to achieve. It may also include handling compound nouns for Japanese/Chinese, or handling "ใหม่" (new) as part of a phrase in Thai.

i assume that that would need the user agent to start applying linguistic analysis on a language-specific basis to a large number of languages. Is that feasible?

One possible way to implement is to use ML, as done in BudouX (you can play with it by an extension.) Currently it supports Japanese only, but its basic idea is applicable to any languages.

if you want to do this kind of thing manually then U+2060 WORD JOINER is your friend.

Thanks, it is indeed helpful. I hope CSS can support better ways than inserting WORD JOINER on every break opportunities, but until then, we can use the workaround.

r12a commented 2 years ago

I think the title and the original description was misleading, sorry about that. The actual intention of this issue is about supporting a "natural" line breaking. From our point of view, handling particles as part of a phrase is an example for Japanese to explain what it wants to achieve.

Yes, i get that, thanks. Perhaps we should change the issue title ?

It may also include handling compound nouns for Japanese/Chinese, or handling "ใหม่" (new) as part of a phrase in Thai.

Unless there are user preferences for whether or not Thai compounds are split/broken as a general rule, i worry that that is in the territory of in/correct segmentation, rather than natural segmentation, if you see what i mean.

[I added clreq and sealreq (SE Asia) labels to the issue, so that those folks will see it.]

r12a commented 2 years ago

Is the intention is to treat natural line breaking as a separate set of controls from those used for kinsoku-like rules (punctuation wrapping) and the strict|normal|loose controls for controlling line-breaking around small kana?

fantasai commented 2 years ago

My personal read on this issue is that there is a lot more research and development to be done here, and that it's premature to build this into CSS. If we just added a value to "turn this on", each implementation will break substantially differently as it tries to find what "natural line breaking" is for any given language, and we'd end up locked into one particular algorithm as Web compat builds on whatever implementations came first, regardless of what is actually more "natural". And minority languages will suffer the worst mistakes.

Line-breaking has significant impact on layout, especially if we're working with higher-level, and therefore larger, constructs. Non-interop across implementations or across time can create real breakage.

Instead, I'd like to see the wrap-* properties implemented so that they can be used by server-side and JS libraries (as well as manually), which allows for a lot more experimentation. Or we could add an API that allows the page to provide a JS library to apply additional line-breaking restrictions on top of the default set, if we don't want to touch the markup. Then down the line if these libraries end up converging on a set of common behaviors, we can consider standardizing those, language by language.

Yes, this is more heavyweight on a given page than a native browser implementation. But it avoids locking us into compat restrictions that prevent any improvement in the feature once deployed. For something that depends on linguistic analysis, which is full of constantly improving heuristics, I think it's important not to get locked in. It's OK if a page chooses a library that it deems good enough. It's a problem if the browser chooses a library that, in the context of all the world's content, is not good enough.

CC @litherum

litherum commented 2 years ago

A few thoughts:

  1. I think this (abstract) feature is a good idea. Browsers can do much better in their line breaking than they do.
  2. Guarding this behind an opt-in is a good idea for performance. (Edit: Actually, depending on how this feature is scoped, it may actually make sense to experiment with enabling it by default. Benchmarks and a proof-of-concept implementation would be useful.)
  3. I'm not sure that this is implementable today. I'm not aware that either Foundation or ICU has any functionality to determine these breaking locations. Without a demonstration of how to implement this, I'd be against this proposal.
  4. As for the mechanism of exposing this in CSS, I don't think I have opinions. We already have text-wrap: pretty so this seems like it may want to be another value to the text-wrap property.
  5. I wonder whether the algorithm for this new line breaking mode would be "it's just like the greedy approach we have today, but the opportunities are in different places" or if it's more complicated like "you can break in some particular position, but there's a cost, and it's only worth it if breaking there means you can choose better positions in the rest of the paragraph"
  6. We (or Unicode) would also need to determine how this would work in all languages, not just Japanese. English has phrases, and there is an art of laying out a title (e.g. in print publications). Would it apply there?
  7. wrap-inside:avoid is kind of similar to hyphens:manual, but the real feature here would be the equivalent of hyphens:auto. That's why I don't think that wrap-inside:avoid is sufficient. You'd want a single switch to flip on this kind of line breaking, rather than having the author have to implement it themselves. And, if the author actually wants to implement it themselves, wrap-inside:avoid is there for them, and that will cause consistent renderings across browsers.
kojiishi commented 2 years ago

Thanks for the feedback again and sorry for my belated replies.

@r12a

Is the intention is to treat natural line breaking as a separate set of controls from those used for kinsoku-like rules (punctuation wrapping) and the strict|normal|loose controls for controlling line-breaking around small kana?

Yes. Authors want to control the strength of Kinsoku-rules separately from this feature.

@litherum:

  1. I think this (abstract) feature is a good idea. Browsers can do much better in their line breaking than they do.

Fully agree with you.

  1. Guarding this behind an opt-in is a good idea for performance. (Edit: Actually, depending on how this feature is scoped, it may actually make sense to experiment with enabling it by default. Benchmarks and a proof-of-concept implementation would be useful.)

Not only for performance, but this should be authors' choice.

For example, the default line breaking of Japanese TeX is "balanced" with normal break opportunities (every character except where the Kinsoku rules apply.) This is because ragged right is rather a large penalty for CJK line breaking. Authors normally prefer less ragged-right lines over phrase-based break opportunities for body text, but may prefer phrase-based line breaking for display text. They may want to use "balanced" line breaking for both cases.

  1. I'm not sure that this is implementable today. I'm not aware that either Foundation or ICU has any functionality to determine these breaking locations. Without a demonstration of how to implement this, I'd be against this proposal.

ICU 71 supports Japanese phrase-based line breaking with a new value for the lw keyword. The lw keyword was chosen because the phrase-based behavior is exclusive to break-all and keep-all.

The original BudouX supports Python and JavaScript only, but it was ported to Swift, Go, and Rust.

Android 13 supports wrap text by Bunsetsu (the smallest unit of words that sounds natural) or phrases.

  1. I wonder whether the algorithm for this new line breaking mode would be "it's just like the greedy approach we have today, but the opportunities are in different places" or if it's more complicated like "you can break in some particular position, but there's a cost, and it's only worth it if breaking there means you can choose better positions in the rest of the paragraph"

Greedy vs paragraph-level-balanced is a related topic, but they should be set separately, at least for some languages such as Japanese. I'm not sure other languages, such as English, always want to turn on/off both switches together.

  1. We (or Unicode) would also need to determine how this would work in all languages, not just Japanese. English has phrases, and there is an art of laying out a title (e.g. in print publications). Would it apply there?

Excellent point, thank you for pointing this out. I believe it should apply too, as Apple web site does and as I relied to @r12a above, but we are not sure how exactly it should work yet. We know it's applicable for Japanese. We have some good ideas for Chinese, and some rough ideas for Thai and English.

I think, at this moment, "it might apply to other languages" is fine to define a property in CSS. It's similar to how CSS defines a CJK "word" today; sometimes a compound noun is a word, sometimes it's multiple words, they vary depending on the dictionaries, era, or how authors feel more "natural".

  1. wrap-inside:avoid is kind of similar to hyphens:manual, but the real feature here would be the equivalent of hyphens:auto. That's why I don't think that wrap-inside:avoid is sufficient. You'd want a single switch to flip on this kind of line breaking, rather than having the author have to implement it themselves. And, if the author actually wants to implement it themselves, wrap-inside:avoid is there for them, and that will cause consistent renderings across browsers.

Agreed. Also for pre-processors like the BudouX, wrapping each phrase in a span is a complicated work. For example:

<div>Phra<span style="border: 1px solid blue">se1 Phra</span>se2</div>

It's not easy to wrap "Phrase1" and "Phrase2" each in a span. Maybe one can adjust borders, but there are more -- background-image, filter, etc. It'd be great if it becomes easier for pre-processors.

kojiishi commented 2 years ago

@r12a

Yes, i get that, thanks. Perhaps we should change the issue title ?

Done, happy to hear if there are any better suggestions.

Unless there are user preferences for whether or not Thai compounds are split/broken as a general rule, i worry that that is in the territory of in/correct segmentation, rather than natural segmentation, if you see what i mean.

Thai is still in early stage, we may conclude that it's not possible to create general rules for Thai. But we hear desires to improve line breaking for display text from multiple Thai authors, so I think it's worth investigating further. Does this match to what you meant?

kojiishi commented 2 years ago

@fantasai

My personal read on this issue is that there is a lot more research and development to be done here, and that it's premature to build this into CSS. If we just added a value to "turn this on", each implementation will break substantially differently...

From our point of view, what this issue is asking are:

  1. Trying to show examples of "word boundaries (or phrase boundaries)" defined in Word Boundaries in the current CSS Text for Japanese and a few more languages, to define them more strictly and to reduce implementation differences.
  2. Sharing feedback we've got so far in regards to the automatic word boundaries detection, and wish to fulfill their requests.

Can I ask whether you're against the current Word Boundaries in the CSS Text, or you feel the current "word boundaries (or phrase boundaries)" defined in Word Boundaries is solid but phrase boundaries in this issue is premature? If the latter, can I ask how you see these two differently?

kojiishi commented 2 years ago

FYI, BudouX now supports Simplified Chinese. @r12a

r12a commented 2 years ago

This from the ICU 71 release note:

ICU 71 adds phrase-based line breaking for Japanese. Existing line breaking methods follow standards and conventions for body text but do not work well for short Japanese text, such as in titles and headings. This new feature is optimized for these use cases.

litherum commented 2 years ago

🆒‼️

murata2makoto commented 2 years ago

It is extremely difficult to precisely define what Bunsetsu is in the Japanese language. Moreover, I know that a major Japanese publisher failed to apply its in-house definition consistently when preparing its textbooks. (And nobody cares.) I also think that no matter what definition we provide, different dictionaries will lead to different results.

I hope that a rough consensus of Bunsetsu is achieved and applied to every relevant feature of CSS.

murata2makoto commented 2 years ago

There are two major styles of space-separated Japanese writing. One is based on bunsetsu. The other is based on words (This is not actually correct, but I do not want to go into details here). For example, the bunsetsu-style provides

私の 名前は 中野です

while the other provides

私 の 名前 は 中野 です

One of my dyslexic colleagues found it difficult read the former when he was an elementary school student. However, is this a common problem?

Based on a Japanese government research fund, Okumura-sensei of Osaka Medical and Pharmaceutical University and I have conducted a series of experiments for three years. More than ten students in the Learning Disability Centre of this university participated in these experiments. Up to now, we have no reasons to believe that the second style is significantly better than the first style to any of the students.

murata2makoto commented 2 years ago

Some practitioners in the Japan DAISY Consortium have tried BudouX. They welcome it wholeheartedly and are trying to use it in Japanese DAISY textbooks.

jungshik commented 2 years ago

I don't speak Chinese, but knows a tiny bit of classical Chinese (think of it as Latin of East Asia) and can come up with a few examples.

花紅了  : should not break before 了
中国的长城  :  break between 的 (of) and 长城. keep  中国的 and 长城 together. do not break before 的. 
王侯将相宁有种乎 :  should not break before 乎.  keep 王侯将相 together or 王侯 and 将相 together
fantasai commented 2 years ago

Relevant Internationalization Working Group minutes

nigelmegitt commented 1 year ago

There is also a need for grammar-based line breaking in English - see for example the BBC Subtitle Guidelines' section on breaking at natural points, which requires manual line breaks to be inserted based on grammatical rules. This has been a subtitle/caption authoring practice in the UK for decades.

If there were a good way to move this to the rendering domain that would have accessibility benefits, for example by avoiding the need for authors to insert explicit line breaks, so that the text is easy to read regardless of how many lines it flows onto. This is not the same thing as text-wrap: balance; which takes no notice of grammar and just makes the lines a similar length.

kojiishi commented 1 year ago

Closing, as all what this issue needs were resolved at https://github.com/w3c/csswg-drafts/issues/7193#issuecomment-1611772475