Proposal for CSS property to manage dictionary-based line-breaks

r12a commented 5 years ago

The default UAX 14 line breaking property for text in the following scripts is SA (Southeast Asian): Thai, Lao, Myanmar, Khmer, Tai Le, New Tai Lue, Tai Tham, Tai Viet, and Ahom. This means that characters require morphological analysis to determine break opportunities, in a way similar to a hyphenation algorithm. No break opportunities will be found otherwise. Complex context analysis, often involving dictionary lookup of some form, is required to determine non-emergency line breaks. If such analysis is not available, it is recommended to treat them as AL (Ordinary Alphabetic and Symbol Characters), which require other characters to provide break opportunities; otherwise, unless tailored rules are applied, no line breaks are allowed between pairs of them.

Unicode provides the ZWSP (Zero Width Space) as a way to manually control line-break opportunities in scripts such as these.

By default, most browsers apply dictionary lookup to text using several of these scripts, but do so regardless of the actual language being written in that script.

A few problems arise in the current situation.

A language that is written using the Thai script, but which is not Thai, is probably using the wrong dictionary and will not break as the user expects at word boundaries. The same applies for the other scripts. This is similar to what you’d expect if you used the same hyphenation dictionary for several European languages, because they all use the Latin script.
Dictionary-based algorithms don’t always produce the break points expected by users. For example, where compound nouns are used, some algorithms may keep the constituent words together while others split them. Problems also tend to arise for new or unusual words.
For situations requiring fine-grained control, such as display text, text inside graphic elements, or balanced title text, dictionary-based line-breaking is problematic if different browsers use different dictionaries, because one cannot predict the likely placement of text across all browsers.
Even in the same page in a single browser, for text which needs fine grained control, or to manage new or unusual words, the content author can take steps to produce the correct effect manually, but is unable prevent interference from the dictionary-based algorithm.

To provide better control over line-breaking for scripts that don't separate words by spaces, esp. SE Asian, it is proposed to add a new property to CSS (which could be applied to the whole document or just a particular element and its descendants) with the following characteristics.

It would allow content authors to switch off the dictionary/statistical-based line-breaking algorithms for a given range of text (in favour of a ZWSP-based approach).
It would allow dictionary/statistical-based line-breaking to be explicitly applied to the selected range of text, but in that case the algorithms used would need to be associated with a given language, and unless the text is declared to be in that language, they would have no effect.
Outside the range to which the new property is applied, line breaking would continue as it does currently, ie. on detecting a run of characters in a particular script, apply a dictionary for a major language in that script.

These proposals have a correlation to the way hyphenation works in CSS (see https://www.w3.org/TR/css-text-3/#hyphenation).

r12a commented 5 years ago

Sealreq folks, i18n WG, please comment on this proposal. Thanks.

jclark commented 5 years ago

I think a property to turn off dictionary-based line-breaking is an excellent idea.

For the second property, I agree with the first part that dictionary-based line-breaking algorithms be associated with a specific language, but for the second part I suggest something slightly different, which reflects the reality that

language tagging is not in practice common,
the vast majority of text in the Thai script is in the Thai language. I think the rule should be that the dictionary-based line-breaking algorithm is not applied if the language of the text is declared as something other that the language of the algorithm.

I would also suggest that the term "dictionary-based" is not used in the name of the property, since dictionary-based algorithms are a very simplistic approach to SE Asian language segmentation and far from the current state of the art.

In addition to the problems you mention, common problems are:

proper names are not properly broken (the absence of capital letters makes recognizing proper names difficult)
a dictionary is not enough: there can be multiple possible segmentations that result in words all of which are in the dictionary; deciding between them requires semantic analysis.

r12a commented 5 years ago

My main worry is that we shouldn't break lots of pages which previously worked ok because an algorithm was applied by default if, for example, Thai-script characters were detected. It may be that older pages weren't properly marked up for language, but still worked for Thai text.

So i think we'd need to assume that morphological line-breaking algorithms would continue to work for pages in general as they do now. But that when a content author uses the new property, then rules are applied to require appropriate language identification and matching algorithms for the in-scope text. (Or, of course, it could simply be used to turn the algorithm off.)

An observation:

language tagging is not in practice common

From what i see, it is far more common now than it used to be, and awareness continues to grow because more and more aspects of rendering depend on knowing the language context. As with hyphenation, or line-breaking, or case-transforms, i think that a user who wants to use the new feature in CSS will realise that it has to be accompanied by language declarations if it's going to work. So i'm hoping that things will continue to improve in that respect.

And a couple of questions:

dictionary-based algorithms are a very simplistic approach to SE Asian language segmentation and far from the current state of the art

Do you have a pointer to something that provides details about current state of the art? And do you have a suggestion for the name of a property? (I also didn't think 'dictionary' was quite right either, but haven't come up with anything yet – something like morphological-line-breaking seemed to jargonistic...)

jclark commented 5 years ago

State of the art is all based on machine learning, e.g.

https://github.com/sertiscorp/thai-word-segmentation

jclark commented 5 years ago

Maybe the name of the property could use the word "statistical"?

rober42539 commented 5 years ago

I know that the content I work with (primarily Lao) I highly depend on ICU Lao line-breaking. However, having the ability to disable browser's automatic line-breaking would definitely be handy in certain circumstances (like where I have manually provided that as ZWSPs). So, I agree that the first proposal would be awesome.

That said, it might be difficult to ensure that I specifically request line-breaking support for particular languages in advance. Case for this is having a multilingual content site, populated from a database. There could be situations where, as a programmer, I don't know what language they are using or if they are using a combination of languages. Requiring identification of the language in that case may not be possible or practical, and would likely break my current web content. Another case in particular is Lao comments on a social site like Facebook. I doubt that the user typed in ZWSPs on their phone, and I doubt that FB tagged it as Lao for that particular section of text. So, if I understand the 2nd proposal correctly, I think I might see it causing problems.

Another approach might be to allow developers the ability to specifically tell the browser what language and kind of segmentation process they'd prefer, to be able to override the default functionality when needed.

r12a commented 5 years ago

In https://github.com/w3c/sealreq/issues/25#issuecomment-521230093, i suggest that the language matching should only occur for that range of text to which the new line-segmentation property is applied (which could, of course, be the whole page, if desired). Outside that range of text or in its absence, the current ICU-based line-breaking, based on assumed language, would still apply. I think this may address your concern, @rober42539 ?

rober42539 commented 5 years ago

@r12a, yes. That would definitely be a nice way to approach it from my perspective.

r12a commented 5 years ago

In the initial comment, i changed the following:

To address these issues, the following changes to CSS are proposed. Allow content authors to switch off the dictionary-based line-breaking for a given range of text, which could be the whole document or could be just a particular range of text. Associate dictionary-based line-breaking algorithms with a specific language, and require the language of the text to be declared in order for the algorithm to be applied.

to

To provide better control over line-breaking for scripts that don't separate words by spaces, esp. SE Asian, it is proposed to add a new property to CSS (which could be applied to the whole document or just a particular element and its descendants) with the following characteristics.

It would allow content authors to switch off the dictionary/statistical-based line-breaking algorithms for a given range of text (in favour of a ZWSP-based approach).

It would allow dictionary/statistical-based line-breaking to be explicitly applied to the selected range of text, but in that case the algorithms used would need to be associated with a given language, and unless the text is declared to be in that language, they would have no effect.

Outside the range to which the new property is applied, line breaking would continue as it does currently, ie. on detecting a run of characters in a particular script, apply a dictionary for a major language in that script.

NorbertLindenberg commented 5 years ago

Another idea that I heard, maybe from @mhosken, is to block algorithmic line breaks within a few clusters from a ZWSP. That would allow users to take control either of just words that algorithms commonly mishandle or of the entire text. Would that make sense?

r12a commented 5 years ago

I get the feeling that that approach wouldn't give you enough fine-grained control in some cases, and may also spread the line-break override to places where you don't want it. I prefer to put something in place that gives you absolute and direct control over what's happening.

NorbertLindenberg commented 5 years ago

Having ZWSP block algorithmic line breaks isn’t risk-free. However, it gives more control to the authors of text in cases where they have no control over style sheets or language tags, e.g. in social media or in many markdown-based interfaces.

r12a commented 5 years ago

By the way, the Japanese Layout TF has some similar needs. At https://github.com/w3c/jlreq/issues/17 they are discussing two related things:

How to break lines at particular points, rather than wrapping the character that exceeds the line width, eg. in titles or short pieces of text that should balance.
How to produce spaces between 'words' automatically, mostly for accessibility reasons.

I wonder, is there any similar accessibility need for SE Asian languages to separate words for certain users ?

rober42539 commented 5 years ago

I wonder, is there any similar accessibility need for SE Asian languages to separate words for certain users?

From my observation, many Lao and Khmer readers struggle reading well and at a fast pace. I suspect this is the case for Thai and Burmese as well. Add in vision or other challenges, and the problem amplifies in severity. So, yeah, I think there is a need, from my perspective for those users (I myself would probably prefer it because it would strengthen my vocabulary and word recognition)

However, the regional culture has resisted the 'just add spaces' idea, and proper utilization of ZWSP in common, everyday tasks makes its widespread use difficult. I think the idea of being able to show breaks (and ZWSPs) for accessibility purposes could be really beneficial. But here is a quote on a comment to a post of mine from a friend and former coworker who is currently in the US...

Very insightful! We have some Farang(s) who want to pioneer word spacing SEA languages by disregarding this rule: space = short pause.

His point is this: the space is interpreted as a comma. If you have spaces between words, people will think that it is a natural pause, or won't know where phrases end and start. Punctuation such as periods and commas are sparsely used.

So, rather than being a Zero-Width-Space, if the browser renders a 'Half-Width-Space', those who struggle with reading could benefit from both worlds - the idea being that it makes the text more readable, makes the ZWSPs easier to find and edit, and yet tries not to break the established cultural expectation that spaces are used between phrases.

ohbendy commented 5 years ago

I'm told writing primers for young Thai learners separate words out with wordspaces, but I don't have any examples to hand.

rober42539 commented 5 years ago

Here is an image of an elementary Lao writing book that does it to a degree. (The bottom line in particular.)

frivoal commented 5 years ago

Hello from the CSSWG. I was very happy to see your interest in this topic, because I've been working on something highly related.

I noticed this discussion while I was writing up the spec for a first take on a solution accepted by the csswg to address the Japanese https://github.com/w3c/jlreq/issues/17 issue mentioned above, and made a few adjustments so that it would cover what you have described in https://github.com/w3c/sealreq/issues/25#issue-480203861 as well.

Please have a look at the section 2.2 of css-text-4 and its subsections, I think you'll find something very much along the lines of what you had been talking about here.

I have not yet included the ability to insert have width spaces specifically, as discussed in comment https://github.com/w3c/sealreq/issues/25#issuecomment-524775827, in part because while it seems reasonable to me, so far it is only a single comment, and in part because I am not 100% which unicode character you'd want to use for that narrow space. I'm leaning towards U+2009 THIN SPACE, but I'm curious what you think.

type of space	result
no space	กรุงเทพคือสวยงาม
U+0020 SPACE	กรุงเทพ คือ สวยงาม
U+2009 THIN SPACE	กรุงเทพ คือ สวยงาม

Feedback very much appreciated, and contribution of examples highly welcome.

w3c / sealreq

Proposal for CSS property to manage dictionary-based line-breaks #25