w3c / csswg-drafts

CSS Working Group Editor Drafts
https://drafts.csswg.org/
Other
4.46k stars 657 forks source link

[css-text] Questionable Thai words #2455

Closed r12a closed 5 years ago

r12a commented 6 years ago

https://drafts.csswg.org/css-text-3/#word-break-property

แและ·ตัวอย่าง·การเขียน·ภาษาไทย has two many แ characters at the start.

Also, are we sure about the word segmentation for these examples? I tested it by using Firefox to wrap the text, and ended up with the following words:

และ•ตัวอย่าง•การ•เขียน•ภาษา•ไทย

Literally, that means "and sample how write language thai". The current spec word divisions seem to map rather to that of the translation(?)

r12a commented 6 years ago

I'm getting the same segmentation using Chrome and Safari too. Try it (reduce the textarea width): https://r12a.github.io/pickers/thai/?text=%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%E0%B9%81%E0%B8%A5%E0%B8%B0%E0%B8%95%E0%B8%B1%E0%B8%A7%E0%B8%AD%E0%B8%A2%E0%B9%88%E0%B8%B2%E0%B8%87%E0%B8%81%E0%B8%B2%E0%B8%A3%E0%B9%80%E0%B8%82%E0%B8%B5%E0%B8%A2%E0%B8%99%E0%B8%A0%E0%B8%B2%E0%B8%A9%E0%B8%B2%E0%B9%84%E0%B8%97%E0%B8%A2

brucelawson commented 6 years ago

และ•ตัวอย่าง•การ•เขียน•ภาษา•ไทย This is correct.

On Mon, Mar 19, 2018 at 5:47 PM, r12a notifications@github.com wrote:

I'm getting the same segmentation using Chrome and Safari too. Try it: https://r12a.github.io/pickers/thai/?text=%20%20%20% 20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20% 20%20%20%20%20%20%20%20%20%E0%B9%81%E0%B8%A5%E0%B8%B0%E0%B8% 95%E0%B8%B1%E0%B8%A7%E0%B8%AD%E0%B8%A2%E0%B9%88%E0%B8%B2%E0% B8%87%E0%B8%81%E0%B8%B2%E0%B8%A3%E0%B9%80%E0%B8%82%E0%B8%B5% E0%B8%A2%E0%B8%99%E0%B8%A0%E0%B8%B2%E0%B8%A9%E0%B8%B2%E0%B9% 84%E0%B8%97%E0%B8%A2

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/w3c/csswg-drafts/issues/2455#issuecomment-374303906, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXEwMDQbRrvUJibf2q6uUTz7ny1DnSDks5tf-8tgaJpZM4Swm6f .

-- Bruce Lawson www.brucelawson.co.uk www.twitter.com/brucel

frivoal commented 6 years ago

@r12a @brucelawson made a pull request based on your suggested fix. I cannot read thai myself, so please confirm I got it right. https://github.com/w3c/csswg-drafts/pull/2457

frivoal commented 6 years ago

While we're on this, I wondered if we should make a test case out of this, to make sure that browsers correctly get the dictionary based line breaking, and the impact of word-break on it.

It looks like we already have 3 such tests: word-break-normal-th-000.html word-break-break-all-003.html word-break-keep-all-003.html

From what I can tell, they already cover what we'd want for this, and the line breaking seems to be done correctly in the first two.

However, the text in the third one seems problematic: just like in the spec sample, it uses แและ. Shouldn't that be และ instead? Also, I suppose that ภาษาไทย ภาษาไทย would be a better sample text than แและ แและแและ.

@r12a @brucelawson Can you confirm? I'll submit a Pull Request for the test as well if needed.

r12a commented 6 years ago

The reason i noticed the Thai problems in the spec is that i'm currently rewriting those tests and adding to them. :-) I'll send a PR when done.

frivoal commented 6 years ago

Cool. Let me know when you're done and I'll be happy to review. Meanwhile, could you (or @brucelawson) have a look at the spec PR https://github.com/w3c/csswg-drafts/pull/2457 and let me know if it's good to merge?

fantasai commented 6 years ago

OK, I asked a friend and she says และ•ตัวอย่าง•การเขียน•ภาษาไทย is correct.

fantasai commented 6 years ago

Now the question in my mind is, do we have different rules for typesetting vs handwriting? Or is there something else going on here?

jclark commented 6 years ago

Whether there is a word-break at a particular point is not as clear cut in Thai as in English. At some points, you can say there definitely isn't a word-break, for example, within a syllable. At other points, you can say there definitely is. But there's a large grey area, where native Thai speakers will disagree and it really depends on what you mean by "word" and what you are trying to do. The main grey area is compound words. Compound words are a lot more common in Thai than English, say, because native Thai words (not derived from Pali, Sanskrit, Khmer etc) are mono-syllabic. In a compound word, whether you can have a word break is a matter of degree depending on the semantics: to what extent is the meaning of the compound derivable from the meaning of the components. So for example: การเขียน is a compound word meaning "writing" which is composed from "การ" which is a word that is used to create a noun phrase from a verb and "เขียน" which means to write. There's a little bit more to the meaning of "การเขียน" than it's two components, but not a lot, so breaking between การ and เขียน is not as good as breaking between การเขียน and ภาษาไทย, but is pretty OK and it would depend on the typographic situation as to whether in practise a typographer would break there.

I just looked at a Thai weekly news magazine (Matichon weekly) on my breakfast table, which is set with quite narrow columns, and it breaks compounds quite aggressively (i.e. approximately และ•ตัวอย่าง•การ•เขียน•ภาษา•ไทย). You could even break ตัว•อย่าง but that would start to be a little strange.

Another complication is that what you want for line breaking is not necessarily what you want for selection or cursor movement by word.

frivoal commented 6 years ago

@jclark Very interesting. Do you think there is any way this could be made to be influenced by line-break:loose vs line-break:normal vs line-break:strict or is this all too subjective?

Another complication is that what you want for line breaking is not necessarily what you want for selection or cursor movement by word.

Why? Some subjective/intuitive feeling of "it ought to do that", or is there more logic to it, and is that logic we could depend on?

jclark commented 6 years ago

I think it would be very appropriate for choice of words breaks to be influenced by line-break:loose/normal/strict.

A fundamental difficulty is that matching against a dictionary is a far from adequate approach to Thai word-breaking. The state of the art today uses machine learning and a corpus. However, I don't know of any corpus that marks up fine-grained distinctions between word boundaries. Maybe it would be possible to figure that out automatically, but that would be a research problem.

Another big problem area is proper names. These are quite challenging because they are composed from multiple words, but shouldn't be broken, and there are no capital letters to distinguish them (instead there are words, such as the equivalent to Mr/Mrs/Miss, that are typically followed by a proper name).

The goals of words segmentation for line-breaking and for editing are a bit different. With line-breaking, you are trying to maximize the number of line-break opportunities without impairing readability. With editing, predictability is important, and you also want units that correspond as often as possible to what a user wants to edit. But really you would need to do user testing to see what people find convenient. My guess is that it would be convenient to have editing-words be longer than line-breaking-words.

jclark commented 6 years ago

Here's a paper that explains things well.

frivoal commented 6 years ago

Note: this is veering increasingly off topic from the initial issue. This discussion shouldn't be considered an attempt to reopen it as far as I am concerned; it just spurred from the initial problem and seems worth talking about

This is very interesting, and the concepts apply to other non-space separated languages as well—at least section 2 about words, the discussion about sentences in section 3 may be more specific to Thai or at least apply to fewer languages.

For example, although the question of "what's a word" does not normally impact line breaking in Japanese, it sometimes does in titles and headings. Children's books also sometimes insert spaces between words, and then we need to worry about what's a word. Also, I know of some research that shows that there are in Japanese word-based styling effects that can help with reading comprehension, both for the general population as well as for people with reading difficulties such as dyslexia.

And for that kind of exercise, the logic presented in this paper applies perfectly well. I wonder if this is something we could somehow build into the system, given that there are stylistic effects one may want to tie into this.

r12a commented 6 years ago

Related to this general discussion, fwiw, I've been working on some text to describe the various different approaches to line-breaking, that may become an article at some point. It includes the following table, where names represent scripts (it's just an excerpt, and doesn't include the information about archaic scripts). The term 'word' here represents a vague concept that can be one or more syllables, and of course special rules apply to pretty much all scripts affecting what can and can't start and end a line.

  Space as word separator Other word separator Syllable separator No word or syllable separator
Wraps words Hangul*, Arabic, Armenian, Bengali, Cherokee, Cyrillic, Devanagari, Greek, Gujarati, Gurmukhi, Hebrew, Kannada, Latin, Malayalam, Mandaic, N’Ko, Oriya, Sinhala, Syriac, Tamil, Telugu, Thaana, Tifinagh, UCAS, Coptic? Glagolitic, Georgian, Newa?, Mongolian?, Limbu?, Meetei Mayak?, Mro?, Ol Chiki?, Chakma?, Lepcha?, Saurashtra?, Masaram Gondi?, Tai Viet, Pau Cin Hau, Adlam, Osage?, Deseret Ethiopic, Samaritan   Khmer, Lao, Myanmar, Thai, Tai Le?, Tai Tham?
Wraps syllables Sundanese, Buginese ?, Cham, Lisu***   Tibetan Balinese, Javanese, Batak
Wraps characters Hangul*     Chinese, Japanese, Yi ?, Vai

Notice that this divides up the problem space in a slightly different way than the spec. Note, in particular, that it's not always a question of wrapping 'words' when morphological analysis is applied to determine line breaks – several scripts simply wrap at syllable boundaries, whether or not those syllables are complete words. Determining those syllable boundaries, however, may also require understanding the text (eg. unless the application understands the text to some extent, it may be difficult to tell whether a character representing a nasal sound has an inherent vowel (ie. is a syllable in itself), or is just the final consonant in a syllable.)

r12a commented 6 years ago

Btw, I believe 'syllable' means 'orthographic syllable' in at least most cases, but i need to check that. An orthographic syllable may include consonants at the end of one phonetic syllable and the beginning of the next.

fantasai commented 6 years ago

OK, I've fixed the typo in the Thai example and added the variations in handling compound words as an example of UA tailoring for line-break values strict vs. loose. The current wording on the example is:

As UAs can add additional distinctions between ''line-break/strict''/''line-break/normal''/''line-break/loose'' modes, these values can exhibit other differences as well. For example, a UA with sufficiently-advanced Thai language processing ability could choose to map different levels of strictness in Thai line-breaking to these keywords, e.g. disallowing breaks within compound words in ''line-break/strict'' mode (e.g. breaking ตัวอย่างการเขียนภาษาไทย as ตัวอย่าง·การเขียน·ภาษาไทย) while allowing more breaks in ''line-break/loose'' (ตัวอย่าง·การ·เขียน·ภาษา·ไทย).

The example for word-break, which should probably reflect line-break: normal, is currently listed as “และ•ตัวอย่าง•การเขียน•ภาษาไทย”. @jclark Is this an appropriate example, or should the example reflect a different breaking for “normal” text?

@frivoal For the tests, we probably want to set them up so that they'll pass with either ตัวอย่าง·การเขียน·ภาษาไทย or ตัวอย่าง·การ·เขียน·ภาษา·ไทย, since it seems like both should be considered conforming.

As usual, let me know if these edits seem appropriate or if there are additional improvements we should make to the spec.

frivoal commented 6 years ago

Thinking again, I am not quite sure what we should test here, as a smart Thai UA could distinguish between loose/normal/strict to do various levels of breaking, but we're not mandating (or even encouraging) any particular mapping. So I'll remove the "needs test" label.

However, the issue with word-break-keep-all-003.html mentioned in https://github.com/w3c/csswg-drafts/issues/2455#issuecomment-374437766 is still there, so i've opened an issue to deal with that: https://github.com/web-platform-tests/wpt/issues/13329

frivoal commented 6 years ago

Clarifying my previous comment: yes, we could still have a test that checks that any of the possible line breaking is allowed, regardless of the loose/normal/strict, but that's not testing anything in particular other than just thai line breaking working in general, which word-break-normal-th-000.html already does.

r12a commented 5 years ago

I've fixed the typo in the Thai example

Example 6 at https://drafts.csswg.org/css-text-3/#valdef-word-break-keep-all ? Doesn't look different to me...

To be honest, i'd prefer a table here rather than the added complication of a single sentence where the language used for punctuation doesn't seem to be clear (presumably the context is Latin, since a proportional comma is used throughout - though that makes it seem wierd that Latin text doesn't start the example. Also, for example, the comma after the arabic appears to indicate that there's no break point between it and و whereas actually there is a whole bunch of arabic between and the comma should be paired with ی). The table would have the property values down the side, and languages across the top. It would also prevent the double-take i experienced when i saw a period after the last thai word.

r12a commented 5 years ago

added the variations in handling compound words as an example of UA tailoring for line-break values strict vs. loose.

I was actually expecting strict to produce the greater number of break opportunities, rather than fewer, since i intuitively associated strictness with delimiting primitive words. (?) In other words, i'd have expected to see (though i could be wrong):

For example, a UA with sufficiently-advanced Thai language processing ability could choose to map different levels of strictness in Thai line-breaking to these keywords, e.g. disallowing breaks within compound words in strict mode (e.g. breaking ตัวอย่างการเขียนภาษาไทย as ตัวอย่าง·การ·เขียน·ภาษา·ไทย) while allowing more breaks in loose (ตัวอย่าง·การเขียน·ภาษาไทย).

fantasai commented 5 years ago

@r12a

Example 6 at https://drafts.csswg.org/css-text-3/#valdef-word-break-keep-all ? Doesn't look different to me...

There's no more occurrences of แแ in the document, which is the typo you reported in the OP. If there's another typo you'll have to specify what it is...

To be honest, i'd prefer a table here rather than the added complication of a single sentence where the language used for punctuation doesn't seem to be clear

I could change it to a table, but a large part of the point here is dealing with mixed script text, so I'd rather keep it all in a single paragraph. I've removed all the punctuation, though, to avoid the problems you mentioned. Let me know if this is acceptable.

I was actually expecting strict to produce the greater number of break opportunities, rather than fewer, since i intuitively associated strictness with delimiting primitive words. (?)

The progression is more break opportunities -> less opportunities going from anywhere -> loose -> normal -> strict. I don't think making an exception here to make Thai go backwards is a good idea.

Ernedar commented 5 years ago

Hello, I am in this mailing list fairly short time but I would like to ask, if it is in consideration to add percentage in to transform:scale(). Now it is taking [-1, 1] without scaling any number of that type. Scaling by percentage will be nice to have.

Example transform:scale(120%) will work same as transform:scale(1.2) etc.

Thanks for your time

-- Jarda Fišer

mob.: +420 776 205 205, e-mail: fiser.jarda@gmail.com

fantasai commented 5 years ago

Hi @Ernedar I think you're commenting on the wrong issue. Could you please file a new issue against the Transforms spec? Here's the link for filing a new issue. Please tag it against [css-transforms].

fantasai commented 5 years ago

@r12a Could you confirm whether https://github.com/w3c/csswg-drafts/issues/2455#ref-commit-a58db91 and the ensuing comment address your remaining concerns in this issue?

r12a commented 5 years ago

Yes. Thanks.

Btw, fwiw, there are a set of exploratory, interactive test pages linked from https://w3c.github.io/i18n-tests/results/exploring-linebreak, which includes a test for Thai.