w3c / sealreq

Southeast Asian layout task force
34 stars 5 forks source link

Linebreaking CSS controls #7

Open mhosken opened 6 years ago

mhosken commented 6 years ago

For SEA scripts where the default language is dictionary word broken, like Thai, Khmer, Lao, etc. life can be tough on minority languages which use the script but are a different language. No browsers have the word breaking dictionaries for these languages, and there is no mechanism to pass such a dictionary to a browser. So, the only solution for a minority language to get good line breaking is for the text to express the linebreaks manually using ZWSP.

The ZWSP in a text indicates a line break opportunity. But in addition, it is usually necessary to turn off the linebreak opportunities that come from the majority language dictionary. For this we need a CSS linebreak control that says: This text has ZWSP break on those and don't use the dictionary.

A full specification for such a setting would need to describe how a text would be broken if so marked and it contained no spaces or ZWSP, but this is no different to other languages handling very long lines with emergency breaking.

r12a commented 6 years ago

This is very useful post, @mhosken. Thanks. I hear three main recommendations arising:

  1. add a property to the CSS spec to turn off dictionary lookup for line-breaking when needed
  2. add a description of what should happen if line-breaking is off and zwsp/space is unavailable (that may be covered by the fact that CSS uses grapheme clusters as a minimum typographic unit).
  3. look at ways to allow the easy addition of word-breaking dictionaries to a browser

I'd like to hear what others think of this, and how it applies to other languages.

andjc commented 6 years ago

I think something like this is probably necessary by default for Cham. Current Western and Eastern Cham script implementations share the same codepoints.Modern usage for Eastern Cham script users is to add a space between words. This is not the case historically, nor is it always the case with handwritten content.

While Western Cham script content does not use spaces between words.

To add to the complexity there is more than one orthography in use for each script. And not all characters in use are encoded in Unicode.

A regex for syllable boundaries is likely to be orthography dependent.

The easiest way out of the current situation would be give Web developers fine control over linebreaking.

r12a commented 6 years ago

@andjc CSS already applies different rules for line-breaking depending on the language specified for Japanese and Chinese. Could the Cham issues not be addressed by language-specific rules?

jmdurdin commented 6 years ago

@mhosken I agree, definitely would be helpful. Re emergency breaking for Thai, Lao, Khmer scripts, some linguistic information should be used to avoid breaking after prefix vowels, before suffix vowels, etc. as well as, of course, between base characters and diacritics.

NorbertLindenberg commented 6 years ago

It sounds like part of the problem is that dictionaries for the majority language of a script are used for minority languages. Do browsers do that even if the language specified for the text doesn't match the language of the dictionary? Is that a reasonable thing to do in the absence of space or ZWSP characters?

mhosken commented 6 years ago

Yes browsers do break assuming the national language even if the text is marked as being a minority langauge. At least this is true for those browsers using ICU to help with their line breaking. Within ICU if you ask for a linebreak iterator, ICU tests the script of the text and chooses a script specific (or default) linebreaking set of rules. These include specification to say that a dictionary breaker is needed. The dictionary breaker then uses the script to choose a language for the dictionary (the majority language) regardless of what language the text is tagged with (since at the level of a linebreak iterator, the iterator doesn't know about the language the text is tagged with).

NorbertLindenberg commented 6 years ago

So maybe that should be addressed in ICU? It seems the proposed flag to turn off dictionaries would be a workaround for erroneous behavior in ICU.

andjc commented 6 years ago

@r12a Eastern Cham Script could be identified by -Cham-VN subtags. The script subtag distinguishes it from Cham languages in VN written in Arabic and Latin script Eastern Cham script uses spaces between words. An innovation dating to the 19th Century apparently.

Ideal break/wrapping location is at whitespace (ie word boundaries) . Still seeking input on whether it is permissible to break within a word and whether this would differ between the traditonal and BBS camps.

Re Western Cham: whitespace is not used between words. It can be identified as cja-Cham-KH.

Current practice with the more IT savvy is to use ZWSP at word boundaries for typesetting. But presence of ZWSP can not be assumed. As to dictionary lookup. There is more than one typographic and orthographic tradition currently active. And I am uncertain whether the same dictionary will serve all.

Waiting for more concrete information.

cja-Cham-KH unicode content will have characters in it that are not encoded in Unicode.

In VN there is a small camp who believe that a final WA (va) should be encoded. It is present in the EFEO legacy fonts but not present in other legacy fonts. But corresponds to a tradiational Western Cham character.

Similar issues arise with competing Latin orthographies in VN. And multiple Arabic orthographies across Cambodia and VN which include atomic characters not in Unicode.

andjc commented 6 years ago

@NorbertLindenberg  Practice varies. Dictionary based lookup is likely to only be added for some languages. And it is likely that they assume a single language for the script.

For some scripts there will be either breaking at whitespace or breaking at grapheme cluster boundary (which can cause breaks within orthographic syllables).

When working on Sgaw Karen content on government sites we found that often the content would flow out of the content area and not wrap. In others it would wrap but not necessarily ideally.

For these cases the fallback is js regex to inject ZWSP at syllable boundaries. Syllable structure is fairly easy in Sgaw Karen . More challenging for Burmewe/Myanmar for example.

But breaking at syllable boundary is far from ideal for these languages. But it's a hack possible from the Web developers end using js.