Open frivoal opened 5 years ago
Sounds like auto
to me.
Well, we already have an "auto" value, which is actually called normal
on this particular property. I don't think having both auto and normal would be very understandable. the new values is effectively "normal-for-most-things-but-keep-all-for-hangul", so I'm shortening that to "keep-all-hangul", keeping only the way this value is different from normal in its name.
Does it work for Hangul/Hanja mixed-content? cc @jungshik
Is there a reason to give Korean a special value here, versus just changing the behavior of "normal" to better reflect current Korean writing practice?
@kojiishi
Does it work for Hangul/Hanja mixed-content?
Depends what you mean by work. It would break between the Hanja, and not between the Hangul. This is not the ideal behavior, which would also keep Hanja of a single word together, but:
lang(ko)
selector to apply the keep-all
value.@tabatkins
Is there a reason to give Korean a special value here, versus just changing the behavior of "normal" to better reflect current Korean writing practice?
Korean writing/typography culture is undergoing a transition. The keep-all
style is increasingly common, but not universal (yet). There are authors who do continue to expect the current normal
behavior, and might not be too happy if it changed.
Also, it is likely (I don't have data, but it is logically probable) that there are websites out there that would break if we changed the default:
keep-all
behavior, the modern practice is to put spaces between words. But pages containing older text may not have spaces, and a keep-all
type of behavior on these would suppress all/most line breaking opportunities, which would be bad.I'm curious about the reason for separating Korean from Chinese and Japanese.
For my experience, in the text editor or web page for Korean break-all
is the default result.
I think this is because Hangul is a syllable language and also frequently used Korean words consist of relatively few characters.
@jihyerish The reason why some people want that behavior, is that Korean (nowadays) uses spaces between words, but Chinese and Japanese don't. Breaking words the same in all 3 languages is the traditional way to do things, and should continue to exist (and to be the default). However, since Korean does have spaces, doing line breaking in Korean the same as in English is also something (some) people want.
@frivoal I don't think that 'keep-all' is increasingly common. Neither do I buy your reasoning that putting inter-word space is the cause for preferring to have 'keep-all'.
The vast majority of Korean text in books, newspapers, magazines (when the correct typographic standard is adopted) have 'break-all' period. Some ill-typeset documents (especially in 1990's made by poorly i18n'ized DTP software) may use 'keep-all', but that's an aberration !!
Modern Korean orthography always dictates the use of inter-word space (over 80 years at minimum). Yet, breaking at the syllable boundary has been the norm for paragraphs.
Let me tell you what Korean web authors did in mid-1990's when Netscape 1.x didn't do the right thing with Korean line breaking. They wrote a script to insert <wbr>
between every syllable pairs to let Netscape 1.x know that there IS a line breaking opportunity at each and every syllable boundary.
keep-all does have its use. keep-all is preferred for Korean when the corresponding English text does NOT want hyphenation. That is, multi-line titles (song, movie, book, article), multi-line ad copies, etc.
However, they're exceptions rather than norm.
Changing 'word-break: normal' behave like 'word-break: keep-all' for Korean is akin to winding the clock back to 1994 (Netscape 1.x behavior).
One more reason 'keep-all' does not work for Korean is that some Koreans tend to be very fond of German style mega-compound words. So, instead of writing 'Korea University College of Natural Science Department of Physics' (한국대학교 자연 과학 대학 물리학과), they write 'KoreaUniversityCollegeOfNaturalScienceDepartmentOfPhysics' (한국대학교자연과학대학물리학과). I am not a fan of these mega-compound words at all, but a lot of Koreans do use them to my chagrin.
What would happen to those mega-compound words with 'keep-all'?
keep-all does have its use. keep-all is preferred for Korean when the corresponding English text does NOT want hyphenation. That is, multi-line titles (song, movie, book, article), multi-line ad copies, etc.
Note also that Chinese and Japanese do NOT want line-breaking at any random character boundaries, either in the above cases. They also want line-breaking at word-boundary plus alpha. 'Plus alpha' is for keeping 'particles' and 'non-content bearing words' together with content-bearing counterparts. For instance, even though 'わさだ だいがく の がくせい' can be broken into 3 words. the 2nd word (の ; 'of') has to be kept together with the first word in 'titles', 'ad copies', etc.
わさだ だいがく
の
がくせい
Because CSS does not support this use case (it requires PoS tagging), Google has a library for this use case. See https://github.com/google/budou
Note that this is not for regular paragraphs but for multi-line titles, etc.
Another way of saying what I wrote above is that 'justified paragraph alignment' has been the norm in Korean typesetting. Justified alignment works best with 'break-all' (break at syllable boundaries). It's similar to English typesetting for 'justified on both edge' works best with hyphenation (at syllable boundary) enabled.
To have 'keep-all' (English equivalent of NO hyphenation) and 'justified alignment', inter-word spacing has to be adjusted (some can be rather large). In CSS, 'text-align: justify' and 'word-break: keep-all' can be used together.
There are cases where 'ragged alignment on the right' is preferred and 'keep-all' is necessary. However, they're not for regular paragraphs but for multi-line titles and ad-copies,etc.
@jungshik I want to clarify one thing: I am not proposing to change the default behavior of 'word-break: normal' to behave like 'word-break: keep-all'. You are right that "break-all" is and needs to remain the default. I am proposing to add a value, not change the behavior of one.
Sorry for misunderstanding your proposal.
However, I don't see a strong need for that. 'keep-all' behavior is not preferred by the majority of Korean speakers in those cases (UGC where the langauge of a content is not known in advance) and most other cases (exceptions were noted above). Think about why <wbr>
was inserted by a script to force Netscape 1.x to break on any syllable boundaries. I don't think the last 20 years hasn't seen a large shift away from that.
The CSS Working Group just discussed Need additional value of word-break for Korean
.
The fact that this kind of line-breaking isn't implemented anywhere in any line-breaking utility library is concerning. It's difficult to believe that CSS is the first place where software engineers have ever wanted this line breaking behavior. I'd like to discuss this with the ICU maintainers to get their thoughts about this.
Adding a new Hangul-specific keyword seems like the wrong design to solve this problem because the values don't stack. It's unlikely that Korean is the only language with two normal
-style behaviors. Instead, if we wanted to add script-specific behaviors for languages with two normal
-style behaviors, we probably would want to do it with language-specific customizations.
Perhaps something like:
word-break: normal customization(Kore, keep-all)
which would mean "Kore
content uses keep-all
but everything else uses normal
." The first argument would be a ISO 15924 script name, not a lang
tag, because this information has to be determined from the raw characters, rather than an out-of-band annotation like lang
.
This way, the customizations are stackable: for languages with multiple normal
-style behaviors, an author can say "I want normal2 for Korean and normal4 for this other language" when we get around to adding support for customizing that other language.
The intent for this proposal is only to select which of the normal
-style values should be applied for scripts which have multiple normal
-style values. It isn't to select arbitrary line breaking behavior for arbitrary scripts. Therefore, it's important to limit the expressiveness of this proposal to just the cases that actually make sense. In order to limit its expressiveness, either browsers or the spec could list a set of scripts that are accepted here, and this set would initially just contain a single item. If we limit the expressivity in this way, ICU and other line breaking utilities can have flexibility to implement this feature in any way, and browsers don't need lots of custom line breaking code.
This proposal could also use unicode blocks instead of unicode scripts, though that would require something like customization(HANGUL_JAMO, ...) customization(HANGUL_COMPATIBILITY_JAMO, ...) customization(HANGUL_SYLLABLES, ...) customization(HANGUL_JAMO_EXTENDED_A, ...) customization(HANGUL_JAMO_EXTENDED_B, ...)
in order to get all of Korean.
+1 to Myles suggestion of considering a more extensible syntax. Maybe worth considering whether an extensible syntax could handle different typographic patterns when it comes to breaks around punctuation, as well.
This issue is interesting because it only makes sense on content that isn't perfectly language tagged. In the general case, we can't always guarantee that all content will be language tagged perfectly. This has implications for many CSS specs.
I was under an assumption that CSS WG has a general policy to recommend language tagging. While tagging every word looks a bit too much to me (e.g., English words appearing in Arabic/Japanese text), tagging each document should be reasonable.
The fact that this kind of line-breaking isn't implemented anywhere in any line-breaking utility library is concerning. It's difficult to believe that CSS is the first place where software engineers have ever wanted this line breaking behavior.
I know of two places that have that.
InDesign has the two behaviors, although they're entangled with something else: depending on whether you justify on or, you get something like normal
or something like keep-all-hangul
.
It also exists in the Bloomberg Terminal. That previously ran on proprietary software, and now runs on a modified browser engine, which has this ability. They use the keep-all-hangul
style as their default line breaking style (in at least some of the applications. I don't have access to a Bloomberg Terminal to check if it's everywhere).
I was under an assumption that CSS WG has a general policy to recommend language tagging. While tagging every word looks a bit too much to me (e.g., English words appearing in Arabic/Japanese text), tagging each document should be reasonable.
As far as author-supplied text goes, yes. For text that comes from users of the site, I don't think it's practical. See my first comment in this issue for a number of reasons why https://github.com/w3c/csswg-drafts/issues/4285#issue-490847663
I think it's practical for the site to add a few lines of code to scan the text and emit keep-all
if the content has Hangul code points. They would then have better control over CSS properties.
I'm afraid we will lose the reasons to recommend language tagging, because there are cases where it is not possible to add languages without scanning the content. Why do we recommend a page to be tagged as ja
when there are Kana characters?
@litherum Why introduce something new when we can already use wildcards in the language pseudo-class and rely on “the :lang(C)
pseudo-class uses the UA’s knowledge of the document’s semantics to perform the comparison” for interactive content, i. e. user input? Perhaps add a note to that effect.
:lang(ko,
und-Hang, mul-Hang, "*-Hang",
und-Kore, mul-Kore, "*-Kore"
)
{word-break: keep-all;}
If the pseudo-class should only rely on author-supplied, explicit metadata, CSS could introduce a dedicated (highlighting) pseudo-element for writing systems or scripts:
::char(Hang, Kore)
{word-break: keep-all;}
I think it's practical for the site to add a few lines of code to scan the text and emit keep-all if the content has Hangul code points. They would then have better control over CSS properties.
While this would not be hard to write if you put keep-all on the whole text, it would also not serve the need: multilingual text exists, even in user generated content, and would be broken by this rule. Let's say I want to tweet / email / blog / write into a word-365 document / ... the first sentence of the Japanese wikipedia page about Seoul (https://ja.wikipedia.org/wiki/%E3%82%BD%E3%82%A6%E3%83%AB%E7%89%B9%E5%88%A5%E5%B8%82). This is mostly Japanese text, which would be broken by applying keep-all, but it contains 4 syllables of hangul, which would trigger the kind of script you mentioned. Sure, it isn't appropriate to use this script if you know the the content is going to be in Japanese, but the author of the twitter / gmail / wordpress / office 365 / ... doesn't know what the content is going to be when it is content from users. That doesn't mean they can't have an opinion of who text ought to be typeset if it is or contains hangul.
On the other hand, it would be hard to write the kind of script that generate a span around the bits that are in Hangul to put keep-all on, and leave the rest alone.
That is exactly why I recommend scripts to do this kind of work. With script, authors have full control for how keep-all
should apply. They could run full linguistic analysis or ML if they want to, or just code point check if that's good enough for them. It's flexibility and extension points CSS properties cannot provide.
I'm not happy but ok to add the value to the sepc, but as @litherum said, if we were changing the policy and start saying we don't recommend language tagging for twitter/gmail/wordpress/Office/etc., I think we should apply that policy to all other CSS properties too, such as hyphen
.
The kind of script that would reliably do what keep-all-hangul
(or normal customization(Kore, keep-all)
) does on arbitrary content in an editable element is ridiculously hard to do. You've got to deal with interactions with the spell checkers replacing content, with IMEs, with browser undo, with on-the-fly changes to the markup caused by browser-built-in formatting commands… and it cannot be factored out into a sharable library that works on all sites, because it would needs to integrate with the various rich text editing frameworks, which all tend to treat the DOM as a the V of an MVC model, and would overwrite any change made there by a third-party script. This isn't practical.
Inserting spans is very common technique on the web. I don't understand why it's so hard, impossible, nor practical.
As I said, I'll not make a formal objection if other browsers want to implement, but this doesn't look like a good primitive to add to the platform and that I'll be opposed to implement in Blink.
Inserting spans in static content is easy. Inserting spans in (rich text) content while it is being edited is not.
I was actioned to ask i18n for investigation of other languages where something like this is relevant. This is happening over there: https://github.com/w3c/i18n-discuss/issues/11
Pending the i18n investigation, I'm seeing three ways forward here:
keep-hangul
to word-break
to switch Hangul to behave like Latin while leaving everything else as normal
. And follow this same break-/keep- pattern with other scriptnames if we have a few other cases to address.keep(<scriptname>+)
and break(<scriptname>+)
functions to word-break
. These would diff against the normal
baseline behavior.Recap: The fundamental behavior of word-break
is to switch certain Letters from behaving like CJK (break
behavior) to behaving like Latin (keep
behavior) or vice versa: break-all
applies break behavior to all letters, and keep-all
applies keep behavior to all letters. The proposal here is, essentially, to allow subsetting the switch to Hangul.
IMHO: If there aren't use cases for more than one or two values in break()
and keep()
, then we are better off with the one-off keywords, provided they consistently follow the behavior and syntax pattern that we would use for break()
and keep()
(which we should comment into the spec for future us).
InDesign conflates the hyphenation setting with the choice of breaking anywhere or on space with Korean, which we have heard feedback about from users requesting its own distinct setting. Choosing to break on space or break anywhere does relate to other user choices, such as full-justifying the text or creating legal documents where meaning is critical and line breaks can change meaning. We have also heard that the default should move away from break anywhere to breaking on space with some user control to break where they want, for what it's worth.
In regular situations,
word-break: normal
is expected to pick the right kind of word breaking for various scripts, keeping letters of a word together in languages that have word-based line breaking, while allowing wraps in the between letters of a word in languages where that's the normal behavior.However, Korean typography has been evolving, and while the
normal
values corresponds to what used to be normal (allowing wraps in the middle of words), and needs to continue to have this behavior for compat reasons, the preferred behavior is increasingly the one achieved bykeep-all
.In a document that where all parts are properly language tagged,
* { word-break: normal; } lang(ko) { word-break: keep-all; }
achieves the desired behavior.However, this is not quite enough to solve the problem in the case of documents with user-generated content: when a user types content in a textarea, or a contenteditable (of if user generated content is retrieved from a database), the author of the page does not generally know what the language is, and cannot tag it in the markup. The following options are available to them, none of them great:
word-break: normal
on elements accepting user input: This will do "the right thing" for all languages, except for that style of Korean, which will break too often.word-break: keep-all
on elements accepting user input: this will do "the right thing" for space separated languages, including that style of Korean, but will badly break languages like Japanese or Chinese, by disabling wrapping opportunities and causing potential overflow.word-break: normal
on elements accepting user input, but also add a piece of javascript that monitors the content for changes, and switches the whole element towork-break: keep-all
if any hangul text is detected:keep-all
to them as well.* { word-break: normal; } lang(ko) { word-break: keep-all; }
together with a piece of Javascript that adds thelang=ko
attribute (and creates spans/divs as necessary to apply it) on the parts of the text input by the user that contain hangul, and lang="" (or lang=somethingelse, if the somethingelse can be detected reliably) on parts that don't:contenteditable
element? etcSo, to solve this, I propose that we add a
keep-all-hangul
value (or justkeep-hangul
), that behaves the same askeep-all
for the unicode characters that correspond to hangul, andnormal
for everything else.