Closed kojiishi closed 6 years ago
Note, I think applying English hyphenations to lang=CJK pages is helpful even after all browsers implement break-all
properly. I mean, IIUC, some authors prefer hyphenation over break-all. Providing a good way to do so without asking authors to wrap all English words with <span>
is a nice feature.
Tried word-break: break-all, but Gecko and WebKit ignore CJK line breaking rules
That seems to be a spec violation. Fixing this would solve a most of the problem, except
some authors prefer hyphenation over break-all.
That's right, but not sure what to do here.
For languages that do not hyphenate, but often mix English, automatically fallback to English.
That's one possibility. Or the browser could ship with hyphenation dictionary for Japanese which would include a mix of words from various languages, mostly english, often found in Japanese texts?
Shouldn’t :lang(en) {hyphens: auto}
just work, assuming proper markup like <span lang="en">English inside CJK</span>
?
assuming proper markup like English inside CJK
That is the problem. It makes sense to mark up French words within English that way, but doesn't make sense to do so for CJK to me. I'm still not able to explain well how French is different from CJK in this scenario. At this moment, I think writing systems with following characteristics have the same issue:
I'm hoping I18N WG can help us to figure out better criteria and languages that match to the criteria.
For languages that do not hyphenate, but often mix English, automatically fallback to English.
That's one possibility. Or the browser could ship with hyphenation dictionary for Japanese which would include a mix of words from various languages, mostly english, often found in Japanese texts?
I like the idea to allow such a localized English dictionary, thank you.
So is my understanding correct that you prefer lang="ja"
to hyphenate English words, possibly with specialized dictionary, rather than having hyphens-lang
property?
The native scripts don't use hyphenations
There's still a problem where, say, Latin-script words are regularly embedded in a script that does hyphenate, such as Arabic. You need to apply the right hyphenation rules according to the language of the second script.
Here's my thought process:
For any hyphenation to take place, the browser needs a hyphenation dictionary. Hyphenation dictionaries and rules are language specific, so you need to know the language of the text you're going to hyphenate.
Much romaji text in Japanese content will be things such as acronyms, which won't hyphenate anyway, but not all. Other text may be transliterations of Japanese.
It's unlikely that romaji text in Japanese will be marked up for a given language, but it will instead fall under the ja
used for the document or passage as a whole. This is because often the romaji text is not really considered to be in a different language, just in a different script. Even where the words are clearly, say, English Japanese people don't see it as a separate language in the same way as German embedded in English.
Because there's unlikely to be markup, the hyphens property can't be used, because there's no way to tell the language of the non-CJK text.
On the other hand, assuming that it's English may work most of the time. There may however also be Japan-specific terms that are not in the standard English dictionary(?). So a hyphenation algorithm that switches dictionaries as the script changes, and includes perhaps some local Latin words, might work for Japanese.
Embedding German words/phrases into Japanese content is likely to be much the same as embedding it in English – you'd expect to have to indicate that this uses German hyphenation rules rather than English by marking things up. Likewise, if your content contains text in a range of languages, it's best to mark it up. But, as mentioned earlier, typically foreign language text is not the same as romaji text in Japanese.
So we're talking about using a secondary hyphenation language where the script changes. This need not only be for Japanese, it is likely, for example, to be needed for Arabic too. There needs to be a way of knowing which language is typically being embedded in a script - it may not always default to English(?). (For example, for ar-MA it may be French(?).) It may not even be in the Latin script(?). Should one store the information about what language to assume in the browser, or allow the content author to specify it? The latter could also be useful for unusual passages where, say, all the Latin script text is in German, to save time in marking it up.
However, if you want hyphenation to occur, not only do you need to guess or indicate the language of the alternate script, but you also need to disable word-break:break-all
for non-CJK runs of text.
So maybe you need
(a) a new value, word-break:break-all-hyphenate
, to make non-CJK text hyphenate
(b) a new property, alt-script-hyphen-lang: <bcp 47 tag>
In theory, it could be any language in any language, any script in any script. But I don't think the general solution has to take care about that. For rare cases, marking up the correct language sounds fine.
(a) a new value, word-break:break-all-hyphenate That would make sense. Or maybe as a separate value of the
hyphens
property.(b) a new property, alt-script-hyphen-lang: <bcp 47 tag>
Not sure that would work so well. At least in the case of Japanese, while most words are indeed likely to be from English, which language the word is from varies on a per word basis, not on a per document (or element) basis. The same paragraph may have two brand names in English and one in French, for instance.
That's what makes me think that rather than trying to tag the language via css, a dictionary for hyphenation in Japanese should me made of latin-script words.
For arabic, the same applies, and you could have an ar-MA dictionary with more French words, to follow on your example, but the same dictionary could still include some words in English as well (brand names, for instance).
It still sounds to me like it should be solved at the Selectors level, perhaps one of these:
:lang(ja, Latn), /* not the same as ja-Latn */
:lang(ja):lang(Latn),
:lang(ja):script(Latn),
:lang(ja):not(:script(Jpan)), /* ISO 15924: Jpan = Hani + Hrkt = Hani + Hira + Kana */
:lang(ja-Jpan) ::foreign-phrase
{hyphens: auto}
PS: I think pseudo-classes would only work if phrases in Latin script were wrapped in any kind of element, so a pseudo-element is probably the better choice.
how do you define ::foreign-phrase
?
I’m not sure exactly, but probably every run of characters that are neither Common
nor part of the current script context (i.e. Hanji or Kana in the example). Maybe this sould be specified directly: ::foreign-phrase(ja-Jpan)
or ::foreign-phrase(Jpan)
.
I'm probably missing something, I can't identify what it is, but I don't understand how break-all-hyphenate
can help, nor selectors. Sorry if I'm missing something very obvious.
But knowing Arabic having the same issue is great, and @r12a's analysis on language and script makes sense to me. From what I understand, alt-script-hyphen-lang
picks the dictionary for words whose script is not the script of the specified lang
, correct? That looks to solve 99% cases of CJK and Arabic to me.
For more perfection, @frivoal's dictionary idea should help. I think anyone can create such dictionaries without needing CSS to define it, but it looks to me that in addition to it, having alt-script-hyphen-lang
should help interoperable fallback behavior in most common cases.
@frivoal helped me to understand what break-all-hyphenate
does in a comment in #791; so it is to try hyphenate first, but break-all
if there are no hyphenation points in the word?
I wasn't thinking about the case where there are no hyphenation points, but now it looks interesting. I think we can just define that behavior in #791 since we don't have use cases for opposite cases, though maybe it's not easy to reach a consensus.
So, though how isn't in consensus yet, I support both points of @r12a.
I still have very low confidence on how well I understand @Crissov's idea -- sorry, but IIUC, are you trying to apply hyphens: auto
for non-native scripts, while hyphens: manual
for native scripts? It is an interesting idea if my understanding is correct, but the case I would like to solve in this issue is slightly different; it is to apply hyphens: auto
for all text, but different dictionaries depends on the script. Apologies in advance if I still don't seem to understood.
It seems @kojiishi understood me correctly, but I didn’t understand the actual use case.
If I understand it correctly now, Japanese hyphenation dictionaries would simply have to include imported words and phrases that retain their (Latin) script – at least if using full-width chars.
So the issue we're discussing is, for writing systems that have mixed scripts:
The fallback/synthesizing/specialized dictionaries shouldn't need CSS specs, UA can just do it. I18N WG might be able to provide good guidance in doing so though.
To allow authors to specify the secondary/thirdly dictionaries per scripts, we'll need a property. This should go to L4.
For word-break:break-all-hyphenate
, I think we should just change the existing behavior because the other behavior has no good use cases, though I do not understand all what Florian wrote in #791 -- I need more time to understand.
at least if using full-width chars
We don't need hyphenation for full-width chars, since full-width chars are typographically similar to ideographic and can break anywhere. The discussion here is a mix of ASCII Latin letters and CJK/Arabic/other letters.
I stand by https://github.com/w3c/csswg-drafts/issues/785#issuecomment-264397300
This is solved by:
Agenda+ to try to resolve on the suggestion in the previous comment.
Are there any engines that currently implement word-break: break-all
in what the spec says is the correct way? Which? (I'm concerned that it's widely-enough used that if the answer is no, or even possibly yes with insufficient market share, it may not be possible to change.)
@r12a the specific aspect being discussed here is not covered by these tests (but nice tests, otherwise:).
Here's one that does look into it: http://jsbin.com/barazageti/edit?html,css,output EDIT: different host for those who like that one better: https://software.hixie.ch/utilities/js/live-dom-viewer/?saved=5834
Firefox | Chrome | Safari | Edge |
---|---|---|---|
❌ | ✅ | ❌ | ✅ |
So yes, I think it should be web compatible to keep the spec as is.
PS: I should check that we have a test for that in the test suite, or add one based on the one above.
The Working Group just discussed Hyphenation usages in CJK
, and agreed to the following resolutions:
RESOLVED: No change on issues #785
Safari uses the platform's (CoreFoundation's) hyphenation dictionaries. It's valuable for Web content's hyphenation to match the rest of the system.
@litherum makes sense to me. The argument I used (words commonly used in Japanese using the latin alphabet should be in the Japanese hyphenation dictionary) is equally valid outside of the web. Exactly how much to include is of course up to the UA (or the OS, if that's how the UA wants to do it).
Turned the test from https://github.com/w3c/csswg-drafts/issues/785#issuecomment-374866706 into a wpt test case https://github.com/web-platform-tests/wpt/pull/13415. Changing labels accordingly.
The i18n WG agrees to close this issue, and has closed its tracker.
Got a feedback from Japanese web developers that there's no good way to minimize expansions of justified text.
If the text is purely CJK, there won't be much expansions; mostly only at break exceptions, but when long English words appear in the CJK text, lines may expand too much. He then:
hyphens: auto
, but since pages havelang=ja
, hyphenations don't work.word-break: break-all
, but Gecko and WebKit ignore CJK line breaking rules, and hit a bug in Blink.break-all
) is technically possible but don't want to take that route.So I'm thinking a few possible ideas:
hyphens-lang
to override the languages only for hyphenations.Thoughts?
/cc @upsuper @fantasai @frivoal @r12a