[css-text-3] Hyphenation usages in CJK

kojiishi commented 7 years ago

Got a feedback from Japanese web developers that there's no good way to minimize expansions of justified text.

If the text is purely CJK, there won't be much expansions; mostly only at break exceptions, but when long English words appear in the CJK text, lines may expand too much. He then:

Tried hyphens: auto, but since pages have lang=ja, hyphenations don't work.
Tried word-break: break-all, but Gecko and WebKit ignore CJK line breaking rules, and hit a bug in Blink.
Using full-width alphabet (breaks like break-all) is technically possible but don't want to take that route.

So I'm thinking a few possible ideas:

hyphens-lang to override the languages only for hyphenations.
For languages that do not hyphenate, but often mix English, automatically fallback to English.

Thoughts?

/cc @upsuper @fantasai @frivoal @r12a

kojiishi commented 7 years ago

Note, I think applying English hyphenations to lang=CJK pages is helpful even after all browsers implement break-all properly. I mean, IIUC, some authors prefer hyphenation over break-all. Providing a good way to do so without asking authors to wrap all English words with <span> is a nice feature.

frivoal commented 7 years ago

Tried word-break: break-all, but Gecko and WebKit ignore CJK line breaking rules

That seems to be a spec violation. Fixing this would solve a most of the problem, except

some authors prefer hyphenation over break-all.

That's right, but not sure what to do here.

For languages that do not hyphenate, but often mix English, automatically fallback to English.

That's one possibility. Or the browser could ship with hyphenation dictionary for Japanese which would include a mix of words from various languages, mostly english, often found in Japanese texts?

Crissov commented 7 years ago

Shouldn’t :lang(en) {hyphens: auto} just work, assuming proper markup like <span lang="en">English inside CJK</span>?

kojiishi commented 7 years ago

assuming proper markup like English inside CJK

That is the problem. It makes sense to mark up French words within English that way, but doesn't make sense to do so for CJK to me. I'm still not able to explain well how French is different from CJK in this scenario. At this moment, I think writing systems with following characteristics have the same issue:

English words are used together with native scripts so often that people doesn't consider they are not part of the native writing system.
The native scripts don't use hyphenations.

I'm hoping I18N WG can help us to figure out better criteria and languages that match to the criteria.

For languages that do not hyphenate, but often mix English, automatically fallback to English.

That's one possibility. Or the browser could ship with hyphenation dictionary for Japanese which would include a mix of words from various languages, mostly english, often found in Japanese texts?

I like the idea to allow such a localized English dictionary, thank you.

So is my understanding correct that you prefer lang="ja" to hyphenate English words, possibly with specialized dictionary, rather than having hyphens-lang property?

r12a commented 7 years ago

The native scripts don't use hyphenations

There's still a problem where, say, Latin-script words are regularly embedded in a script that does hyphenate, such as Arabic. You need to apply the right hyphenation rules according to the language of the second script.

Here's my thought process:

For any hyphenation to take place, the browser needs a hyphenation dictionary. Hyphenation dictionaries and rules are language specific, so you need to know the language of the text you're going to hyphenate.

Much romaji text in Japanese content will be things such as acronyms, which won't hyphenate anyway, but not all. Other text may be transliterations of Japanese.

It's unlikely that romaji text in Japanese will be marked up for a given language, but it will instead fall under the ja used for the document or passage as a whole. This is because often the romaji text is not really considered to be in a different language, just in a different script. Even where the words are clearly, say, English Japanese people don't see it as a separate language in the same way as German embedded in English.

Because there's unlikely to be markup, the hyphens property can't be used, because there's no way to tell the language of the non-CJK text.

On the other hand, assuming that it's English may work most of the time. There may however also be Japan-specific terms that are not in the standard English dictionary(?). So a hyphenation algorithm that switches dictionaries as the script changes, and includes perhaps some local Latin words, might work for Japanese.

Embedding German words/phrases into Japanese content is likely to be much the same as embedding it in English – you'd expect to have to indicate that this uses German hyphenation rules rather than English by marking things up. Likewise, if your content contains text in a range of languages, it's best to mark it up. But, as mentioned earlier, typically foreign language text is not the same as romaji text in Japanese.

So we're talking about using a secondary hyphenation language where the script changes. This need not only be for Japanese, it is likely, for example, to be needed for Arabic too. There needs to be a way of knowing which language is typically being embedded in a script - it may not always default to English(?). (For example, for ar-MA it may be French(?).) It may not even be in the Latin script(?). Should one store the information about what language to assume in the browser, or allow the content author to specify it? The latter could also be useful for unusual passages where, say, all the Latin script text is in German, to save time in marking it up.

However, if you want hyphenation to occur, not only do you need to guess or indicate the language of the alternate script, but you also need to disable word-break:break-all for non-CJK runs of text.

So maybe you need (a) a new value, word-break:break-all-hyphenate, to make non-CJK text hyphenate (b) a new property, alt-script-hyphen-lang: <bcp 47 tag>

frivoal commented 7 years ago

In theory, it could be any language in any language, any script in any script. But I don't think the general solution has to take care about that. For rare cases, marking up the correct language sounds fine.

(a) a new value, word-break:break-all-hyphenate That would make sense. Or maybe as a separate value of the hyphens property.

(b) a new property, alt-script-hyphen-lang: <bcp 47 tag>

Not sure that would work so well. At least in the case of Japanese, while most words are indeed likely to be from English, which language the word is from varies on a per word basis, not on a per document (or element) basis. The same paragraph may have two brand names in English and one in French, for instance.

That's what makes me think that rather than trying to tag the language via css, a dictionary for hyphenation in Japanese should me made of latin-script words.

For arabic, the same applies, and you could have an ar-MA dictionary with more French words, to follow on your example, but the same dictionary could still include some words in English as well (brand names, for instance).

Crissov commented 7 years ago

It still sounds to me like it should be solved at the Selectors level, perhaps one of these:

:lang(ja, Latn), /* not the same as ja-Latn */
:lang(ja):lang(Latn),
:lang(ja):script(Latn),
:lang(ja):not(:script(Jpan)), /* ISO 15924: Jpan = Hani + Hrkt = Hani + Hira + Kana */
:lang(ja-Jpan) ::foreign-phrase
    {hyphens: auto}

PS: I think pseudo-classes would only work if phrases in Latin script were wrapped in any kind of element, so a pseudo-element is probably the better choice.

frivoal commented 7 years ago

how do you define ::foreign-phrase?

Crissov commented 7 years ago

I’m not sure exactly, but probably every run of characters that are neither Common nor part of the current script context (i.e. Hanji or Kana in the example). Maybe this sould be specified directly: ::foreign-phrase(ja-Jpan) or ::foreign-phrase(Jpan).

kojiishi commented 7 years ago

I'm probably missing something, I can't identify what it is, but I don't understand how break-all-hyphenate can help, nor selectors. Sorry if I'm missing something very obvious.

But knowing Arabic having the same issue is great, and @r12a's analysis on language and script makes sense to me. From what I understand, alt-script-hyphen-lang picks the dictionary for words whose script is not the script of the specified lang, correct? That looks to solve 99% cases of CJK and Arabic to me.

For more perfection, @frivoal's dictionary idea should help. I think anyone can create such dictionaries without needing CSS to define it, but it looks to me that in addition to it, having alt-script-hyphen-lang should help interoperable fallback behavior in most common cases.

kojiishi commented 7 years ago

@frivoal helped me to understand what break-all-hyphenate does in a comment in #791; so it is to try hyphenate first, but break-all if there are no hyphenation points in the word?

I wasn't thinking about the case where there are no hyphenation points, but now it looks interesting. I think we can just define that behavior in #791 since we don't have use cases for opposite cases, though maybe it's not easy to reach a consensus.

So, though how isn't in consensus yet, I support both points of @r12a.

I still have very low confidence on how well I understand @Crissov's idea -- sorry, but IIUC, are you trying to apply hyphens: auto for non-native scripts, while hyphens: manual for native scripts? It is an interesting idea if my understanding is correct, but the case I would like to solve in this issue is slightly different; it is to apply hyphens: auto for all text, but different dictionaries depends on the script. Apologies in advance if I still don't seem to understood.

Crissov commented 7 years ago

It seems @kojiishi understood me correctly, but I didn’t understand the actual use case.

If I understand it correctly now, Japanese hyphenation dictionaries would simply have to include imported words and phrases that retain their (Latin) script – at least if using full-width chars.

kojiishi commented 7 years ago

So the issue we're discussing is, for writing systems that have mixed scripts:

Logically speaking, using a dictionary that are mix of multiple scripts makes sense and people seem to be good with it. Whether it should be a such physical dictionary, or allow UA to synthesize such mixed dictionary looks controversial, but I think it's implementation details CSS doesn't care about.
When UA synthesizes such a mixed dictionary, @r12a pointed out that authors may want to specify the dictionary for Latin. This allows scripts such as CJK or Arabic to use English or French depends on the content.

The fallback/synthesizing/specialized dictionaries shouldn't need CSS specs, UA can just do it. I18N WG might be able to provide good guidance in doing so though.

To allow authors to specify the secondary/thirdly dictionaries per scripts, we'll need a property. This should go to L4.

For word-break:break-all-hyphenate, I think we should just change the existing behavior because the other behavior has no good use cases, though I do not understand all what Florian wrote in #791 -- I need more time to understand.

kojiishi commented 7 years ago

at least if using full-width chars

We don't need hyphenation for full-width chars, since full-width chars are typographically similar to ideographic and can break anywhere. The discussion here is a mix of ASCII Latin letters and CJK/Arabic/other letters.

frivoal commented 6 years ago

I stand by https://github.com/w3c/csswg-drafts/issues/785#issuecomment-264397300

This is solved by:

browsers fixing their implementation of break-all
including foreign-words-that-are-commonly-used-in-japanese in the japanese hyphenation dictionary
authors marking up rare foreign words as the correct language.

frivoal commented 6 years ago

Agenda+ to try to resolve on the suggestion in the previous comment.

dbaron commented 6 years ago

Are there any engines that currently implement word-break: break-all in what the spec says is the correct way? Which? (I'm concerned that it's widely-enough used that if the answer is no, or even possibly yes with insufficient market share, it may not be possible to change.)

r12a commented 6 years ago

@dbaron https://w3c.github.io/i18n-tests/results/word-break#breakall

frivoal commented 6 years ago

@r12a the specific aspect being discussed here is not covered by these tests (but nice tests, otherwise:).

Here's one that does look into it: http://jsbin.com/barazageti/edit?html,css,output EDIT: different host for those who like that one better: https://software.hixie.ch/utilities/js/live-dom-viewer/?saved=5834

Firefox	Chrome	Safari	Edge
❌	✅	❌	✅

So yes, I think it should be web compatible to keep the spec as is.

PS: I should check that we have a test for that in the test suite, or add one based on the one above.

css-meeting-bot commented 6 years ago

The Working Group just discussed Hyphenation usages in CJK, and agreed to the following resolutions:

RESOLVED: No change on issues #785

The full IRC log of that discussion

<dael> Topic: Hyphenation usages in CJK
<dael> github: https://github.com/w3c/csswg-drafts/issues/785
<dael> florian: In Japanese it uses mixed writing systems. Just because there's latin characters it doesn't mean it's English. It won't get language tagged as English, but they do expect hypenation.
<dael> florian: My take is it should just be in the hyphenation dictionary for Japanese. If people want a hypen the hyphen preperty does that if it's in the dictionary. For the rare word that isn't in the dictionary it shoudl be language tagged
<dael> myles: Is this just Japanese?
<dael> florian: If you're going to use german words, not tag them, and use them in English I'm going to assume it's common in English or the author did it wrong. English does not contain all words of all languages.
<tantek> Schadenfreude for non-English languages?
<dael> florian: It is somewhat common in Japanese to have words like that, but if they're common they should be in the dictionary. I dont' want a property to hyphenate words not in the language they're in.
<dael> florian: tantek's example is good.
<dael> florian: Trick with Japanese where you can't tell it by the script in English and German, you can tell it by the script. But if a Japanese person wrote Schadenfreude you can't tell it's not English.
<dael> florian: If i's common put it in the dictionary. If it's not known it's not known.
<dael> astearns: So florian you suggest don't do anything?
<dael> florian: Other then fix bugs in some browsers. In the case where the author doesn't want hyphens, but just breaks between all characters, there's word-break:break-all it's only going to give you places except where there shoudl never be a line break. 2 browsers break absolutely everywhere.
<dael> fantasai: Initial complaint is solved by browsers fixing impl of break-all keyword to conform to the spec and include words commonly used in Japanese that use latin characters. For words that aren't common and won't be in Japanese dicitionary authors will need markup to be appropriate.
<dael> fantasai: General conclusion is no change to the spec, but initial problem is solved by the three things florian outlines.
<fantasai> https://github.com/w3c/csswg-drafts/issues/785#issuecomment-370366049
<dael> myles: spec doesn't say which hyphen opportunities exist. I guess I agree with florian. Impl have a lot of leeway on how to hyphenate.
<dael> astearns: I'm a little concerned about resolving without koji but I'm personally convinced about no change.
<dael> astearns: Objections to resolving no change on this?
<dael> RESOLVED: No change on issues #785
<dael> hober: florian do you have links to the bugs?
<dbaron> I think the relevant Gecko bug is https://bugzilla.mozilla.org/show_bug.cgi?id=1358019 although I might be wrong
<dael> florian: I have not. I only wrote the test an hour ago. I'll make a proper test and submit bugs.
<dbaron> (and that was filed by fantasai)
<dael> florian: koji found this and he said he can't do it because it didn't work in some browsers. dbaron asked if it was some or all and I wrote this test and it's some. It was an ad hock test so I'll write a proper one.
<TabAtkins> (just btw)

litherum commented 6 years ago

Safari uses the platform's (CoreFoundation's) hyphenation dictionaries. It's valuable for Web content's hyphenation to match the rest of the system.

frivoal commented 6 years ago

@litherum makes sense to me. The argument I used (words commonly used in Japanese using the latin alphabet should be in the Japanese hyphenation dictionary) is equally valid outside of the web. Exactly how much to include is of course up to the UA (or the OS, if that's how the UA wants to do it).

frivoal commented 6 years ago

Turned the test from https://github.com/w3c/csswg-drafts/issues/785#issuecomment-374866706 into a wpt test case https://github.com/web-platform-tests/wpt/pull/13415. Changing labels accordingly.

r12a commented 4 years ago

The i18n WG agrees to close this issue, and has closed its tracker.

w3c / csswg-drafts

[css-text-3] Hyphenation usages in CJK #785