w3c / sealreq

Southeast Asian layout task force
34 stars 5 forks source link

Using dictionaries can create problems for word-breaking #35

Open r12a opened 4 years ago

r12a commented 4 years ago

"ICU use word boundaries to break but it looks not nice, because it depend on the people who provide wordlist, for example the name of USA (United State of America) in Khmer it is សហរដ្ឋអាមេរិក ICU consider as one word, when it break to new line, it remain the long blank in old line. Normally, we can break it to 2 word សហរដ្ឋ = United State and អាមេរិក." (Hong)

"There is a change going through ICU at the moment, to how Khmer is line broken. The basis of line breaking is still dictionary based and word broken. There is no intent to support syllable breaking. The following changes are made in that change:

  1. Bad and ambiguous spellings are correctly handled
  2. Use of ZWSP and WJ are disambiguated with regard to how far they limit linebreaking. In the case of Khmer they have a range of up to 3 small clusters (base+Marks+Coengs) but may collapse to 0 for longer words." (@mhosken)

An issue with the use of dictionary lookup is that browsers don't have dictionary lookup support for minority languages that use the Khmer script. And in fact, regardless of the declared language of the text, browsers tend to apply the Khmer dictionary to text written in the Khmer characters.

For such languages, it would be helpful if the content author could either:

  1. disable the dictionary lookup and let the line-breaking depend on ZWSP insertion, or
  2. invoke a different dictionary – perhaps one that is provided as a browser extension.

Marking this as advanced for now for the Cambodian language, but open to arguments that the difficulties produced are worth a status of basic.

For minority languages, the status is clearly going to be broken, since there's no way to override the use of the Khmer dictionary.

r12a commented 4 years ago

The first comment in this issue contains text that will automatically appear in the Khmer gap-analysis document as a subsection with the same title as this issue. Any edits made to that comment will be immediately available in the document. Proposals for changes or discussion of the content can be made in comments below this point.