w3c / csswg-drafts

CSS Working Group Editor Drafts
https://drafts.csswg.org/
Other
4.52k stars 673 forks source link

[css-text-3] Allow breaking anywhere when dictionary is missing for SEA scripts #4284

Closed fantasai closed 5 years ago

fantasai commented 5 years ago

For scripts that require dictionary breaking or some other morphological analysis, if the resource is missing and the UA can't break the text, it should be allowed to break anywhere instead of overflowing.

fantasai commented 5 years ago

Note: This has been a problem with Javanese.

r12a commented 5 years ago

Well perhaps not anywhere. Certainly grapheme clusters would likely be an appropriate minimum. For Javanese and other scripts, stacked consonants should not be split, and the natural unit for line-breaking is otherwise the syllable. I'm not completely sure about the rules for splitting lefted vowel signs from the rest of the syllable in scripts such as Thai and Lao (where these are not combining characters).

rcampbellbassac commented 5 years ago

I think that in Lao, there should be a way to detect the syllables. ICU supports word boundary analysis/word tokenization, so this is less of an issue now except on the few applications not using ICU (Firefox doesn't yet use ICU's line-breaker, though they use other parts of ICU it seems).

But, if we want to be safe, and not assume that the browser has ICU support, it would be much more desirable to break at the syllable, rather than cutting one in half. Some vowels in Lao (Thai, Khmer, Burmese) wrap around the nuclear consonant, so if you break it at the wrong place, it cuts your vowel in half between 2 lines (very difficult to read). There is a document I will link to that helps explain this...

panl10n.net Syllabification of Lao Script for Line Breaking

rcampbellbassac commented 5 years ago

image

This excerpt explains the 'format' for which a Lao syllable is constructed.

Syllable breaking is usable in Lao and many people tend to be OK with it, but ultimately it isn't optimal, as it can render text more difficult to read than if the line-breaking is based on word boundaries, from my understanding.

jmdurdin commented 5 years ago

Not sure about panI10n.net rules but Lao syllable breaking has been fully implemented by Lao Script for Windows since about 1993 (usually using ZWSP insertion). Excluding loan words, there are few ambiguities, and they are easily managed by not allowing a break if it would be ambiguous. For Lao, as well as keeping grapheme clusters together, a break should never be allowed after a prefix vowel or before U+0EB2 LAO VOWEL SIGN AA, or either before or after U+0EBD LAO SEMIVOWEL SIGN NYO. Thai syllable breaking is much more difficult and requires a moderately large dictionary to be effective.

fantasai commented 5 years ago

Just want to be clear, this issue isn't about where to break correctly. It's what to do if you don't have the ability to break correctly.

fantasai commented 5 years ago

The proposal is "if you don't know where it's allowed to break, break somewhere, anywhere, instead of overflowing the box by not breaking at all". Because "hard to read" is better than "clipped and therefore unreadable".

css-meeting-bot commented 5 years ago

The CSS Working Group just discussed Allow breaking anywhere when dictionary is missing for SEA scripts, and agreed to the following:

The full IRC log of that discussion <dael> Topic: Allow breaking anywhere when dictionary is missing for SEA scripts
<dael> github: https://github.com/w3c/csswg-drafts/issues/4284
<dael> fantasai: Certain lang where breakpoint not obvious from character code. hvae to do analysis. If you do not have the dictionary or rules in the engine you don't break the text and it'll be long and overflow. I suggest saying if you don't know how to break then you should break somewhere. Doesn't matter where but between grapheme clusters. hvae to have break opportunities
<dael> myles: Did you mean must?
<dael> fantasai: Yeah
<dael> fantasai: Proposal to add that. Discussion in issue about where to break in languages, but this is about what to happen when UA doens't have rules.
<dael> florian: I think saying you must break somewhre and not middle of grapheme cluster. If you can do mid analysis with meaningful unit of breaking do that. But must break and not break grapheme closters
<dael> myles: How does browser know which scripts?
<dael> fantasai: THere's a classification, let me see.
<fantasai> http://unicode.org/reports/tr14/#SA
<dael> fantasai: Class SA is complex context dependant. If you're one of these scripts and don't have a resource to tell you where to break you should break somewhere
<dael> myles: As long as spec says that this is fine
<dael> fantasai: Okay
<dael> astearns: Other concerns?
<dael> fantasai: Prop: If there is a language for which you do not know the breaking rules. Rather then treating as unbreakable you treat it as breakable anywhere
<dael> astearns: And something about not breaking through grapheme cluster?
<dael> fantasai: Yes. If we copy from overflow: anywhere that comes
<dael> RESOLVED: f there is a language for which you do not know the breaking rules. Rather then treating as unbreakable you treat it as breakable anywhere similar to overflow:anywhere