w3c / sealreq

Southeast Asian layout task force
34 stars 6 forks source link

How is line-breaking handled on the Web for Lao? #3

Open r12a opened 6 years ago

r12a commented 6 years ago

The current understanding at W3C is that Lao behaves like Thai when lines are wrapped. See http://w3c.github.io/i18n-drafts/articles/typography/linebreak.en#sec_se_asia for a very high-level summary.

The CSS specification deals with line-breaking at https://drafts.csswg.org/css-text-3/#line-break-property. Note particularly the text about Thai that says:

As UAs can add additional distinctions between strict/normal/loose modes, these values can exhibit other differences as well. For example, a UA with sufficiently-advanced Thai language processing ability could choose to map different levels of strictness in Thai line-breaking to these keywords, e.g. disallowing breaks within compound words in strict mode (e.g. breaking ตัวอย่างการเขียนภาษาไทย as ตัวอย่าง·การเขียน·ภาษาไทย) while allowing more breaks in loose (ตัวอย่าง·การ·เขียน·ภาษา·ไทย).

The question for this issue is whether the same applies to Lao, and whether there are other features of Lao line breaking that need to be called out in the spec. For example, which of these is true?

  1. You can break text at line end for Lao between syllables without being concerned about word boundaries.
    1. There is a preference for breaking at word boundaries, but breaking at syllable boundaries is also common.
    2. Text in Lao should always break at recognisable words at word boundaries.

We are also looking for evidence of current problems related to Lao line-breaking on the Web/in eBooks.

Advice (especially with examples) would be very much appreciated.

mhosken commented 6 years ago

Lao differs from Thai in that nearly all syllables can be argued to be words and therefore, in effect, line breaking can reduce to syllable breaking even with a dictionary. Having said that, as with all SEAsian scripts, Lao prefers to word break and would only consider syllable breaking as a fall back.

Lao has another advantage over Thai in that syllable boundaries can (I need to check fully) be algorithmically derived since there is no inherent vowel and tone class changing 'h' is integrated into the characters that follow it.

r12a commented 6 years ago

Does anyone have any information about how well browsers handle line-breaking in Lao? I assume that the major browsers rely on ICU for line-breaking behaviour(?)

jmdurdin commented 6 years ago

I agree with mhosken but with the following qualifications:

  1. Some syllable boundaries within words are ambiguous (as for Thai).
  2. The increasing use of loan words often breaks the algorithms used for defining syllable boundaries, as does the still common preference for using the two-character combination for "HL" instead of the subscript L, and words that use other high-class sonorants than HM, HN, HL.
  3. Some writers are beginning to use optional hyphens for syllable boundaries within words, which helps readability at line breaks. One other difference from Thai is that European style punctuation (period, comma) is much more widely used in Lao than in Thai, with the consequence that traditional spaced phrase punctuation is now often incorrectly used, with spaces sometimes inserted within words.
jmdurdin commented 6 years ago

Lao line-breaking by web browsers: just checked Edge, IE11, Chrome, Firefox with a document configured as html lang="lo", but only Chrome wraps at word/syllable boundaries. The others only wrap at white-space. Widely used software (on Windows) in Lao PDR and elsewhere automatically inserts ZWSP in Lao text at word or syllable boundaries, and many web pages use such inserted ZWSP characters to get browsers to wrap correctly.

r12a commented 6 years ago

@jmdurdin, interestingly, i got different results. Here's a test file i just created and ran on my Mac: https://w3c.github.io/sealreq/gap-analysis/lao-tests/lao_line_break.html

Though i can't read Lao, so i can't speak to the fine lexical detail, Firefox, Chrome & Safari all appear to wrap certain multisyllable sequences (presumably words) at the line end. For example, the text ປະເທດ wraps as a single item. None of them rely solely on white space for break opportunities.

I don't currently have a Windows machine to test on.

jmdurdin commented 6 years ago

@r12a, Yes, MacOS does implement word/syllable wrap for Lao at a system level, which both Safari and Firefox evidently use. Chrome may or may not use it, as it implements wrap for Lao at the application level on Windows, since Windows 10 does not yet implement line breaking for Lao at a system level. MS Office on Windows implements it at the application level in most situations, Firefox clearly does not, at least in its standard configuration. Safari on iOS also wraps Lao more or less correctly. Line-breaking on Chrome, and Safari could be improved - see attached PDF. Lao line breaking by MS Office has a more serious error in that it sometimes allows breaking after prefix vowels, which is never correct. Lao_line_breaking_issue.pdf

r12a commented 5 years ago

For a page that lets you experiment with line-breaking in Lao on various browsers, see https://w3c.github.io/i18n-tests/css-text/line-breaking/exp-lo-line-break-000.html and https://w3c.github.io/i18n-tests/css-text/line-breaking/exp-lo-line-break-textarea-000.html

For related experimental tests, see https://w3c.github.io/i18n-tests/results/exploring-linebreak