Closed fantasai closed 5 years ago
Note: This has been a problem with Javanese.
Well perhaps not anywhere. Certainly grapheme clusters would likely be an appropriate minimum. For Javanese and other scripts, stacked consonants should not be split, and the natural unit for line-breaking is otherwise the syllable. I'm not completely sure about the rules for splitting lefted vowel signs from the rest of the syllable in scripts such as Thai and Lao (where these are not combining characters).
I think that in Lao, there should be a way to detect the syllables. ICU supports word boundary analysis/word tokenization, so this is less of an issue now except on the few applications not using ICU (Firefox doesn't yet use ICU's line-breaker, though they use other parts of ICU it seems).
But, if we want to be safe, and not assume that the browser has ICU support, it would be much more desirable to break at the syllable, rather than cutting one in half. Some vowels in Lao (Thai, Khmer, Burmese) wrap around the nuclear consonant, so if you break it at the wrong place, it cuts your vowel in half between 2 lines (very difficult to read). There is a document I will link to that helps explain this...
This excerpt explains the 'format' for which a Lao syllable is constructed.
Syllable breaking is usable in Lao and many people tend to be OK with it, but ultimately it isn't optimal, as it can render text more difficult to read than if the line-breaking is based on word boundaries, from my understanding.
Not sure about panI10n.net rules but Lao syllable breaking has been fully implemented by Lao Script for Windows since about 1993 (usually using ZWSP insertion). Excluding loan words, there are few ambiguities, and they are easily managed by not allowing a break if it would be ambiguous. For Lao, as well as keeping grapheme clusters together, a break should never be allowed after a prefix vowel or before U+0EB2 LAO VOWEL SIGN AA, or either before or after U+0EBD LAO SEMIVOWEL SIGN NYO. Thai syllable breaking is much more difficult and requires a moderately large dictionary to be effective.
Just want to be clear, this issue isn't about where to break correctly. It's what to do if you don't have the ability to break correctly.
The proposal is "if you don't know where it's allowed to break, break somewhere, anywhere, instead of overflowing the box by not breaking at all". Because "hard to read" is better than "clipped and therefore unreadable".
The CSS Working Group just discussed Allow breaking anywhere when dictionary is missing for SEA scripts
, and agreed to the following:
RESOLVED: f there is a language for which you do not know the breaking rules. Rather then treating as unbreakable you treat it as breakable anywhere similar to overflow:anywhere
For scripts that require dictionary breaking or some other morphological analysis, if the resource is missing and the UA can't break the text, it should be allowed to break anywhere instead of overflowing.