[css-text] Need additional value of word-break for Korean

frivoal commented 5 years ago

In regular situations, word-break: normal is expected to pick the right kind of word breaking for various scripts, keeping letters of a word together in languages that have word-based line breaking, while allowing wraps in the between letters of a word in languages where that's the normal behavior.

However, Korean typography has been evolving, and while the normal values corresponds to what used to be normal (allowing wraps in the middle of words), and needs to continue to have this behavior for compat reasons, the preferred behavior is increasingly the one achieved by keep-all.

In a document that where all parts are properly language tagged, * { word-break: normal; } lang(ko) { word-break: keep-all; } achieves the desired behavior.

However, this is not quite enough to solve the problem in the case of documents with user-generated content: when a user types content in a textarea, or a contenteditable (of if user generated content is retrieved from a database), the author of the page does not generally know what the language is, and cannot tag it in the markup. The following options are available to them, none of them great:

use word-break: normal on elements accepting user input: This will do "the right thing" for all languages, except for that style of Korean, which will break too often.
use word-break: keep-all on elements accepting user input: this will do "the right thing" for space separated languages, including that style of Korean, but will badly break languages like Japanese or Chinese, by disabling wrapping opportunities and causing potential overflow.
use word-break: normal on elements accepting user input, but also add a piece of javascript that monitors the content for changes, and switches the whole element to work-break: keep-all if any hangul text is detected:
- This breaks if the content input by the user contains a mixture of Korean and languages like Japanese or Chinese, as it would apply keep-all to them as well.
- This isn't a purely declarative solution, so it fail if Javascript is disabled
Use * { word-break: normal; } lang(ko) { word-break: keep-all; } together with a piece of Javascript that adds the lang=ko attribute (and creates spans/divs as necessary to apply it) on the parts of the text input by the user that contain hangul, and lang="" (or lang=somethingelse, if the somethingelse can be detected reliably) on parts that don't:
- Getting this script right is very difficult. Not merely because of how it must analyse the content and adjust the markup accordingly, but also because of how it would need to integrate with editing operations: how to make these DOM modifications inside a content editable in a way that is compatible with the browser's undo stack? How to make them in a way that doesn't interfere with ongoing IME operations? How to make them in a way that is compatible with the hodge podge of markup that different browsers may generate inside a contenteditable element? etc
- Getting this script to be correct AND performant is even harder. But performance is important: not all user input is tweet-sized. Think for instance of an online document editor, which may contain multiple pages of (multilingual) rich text.
- This isn't a purely declarative solution, so it fail if Javascript is disabled

So, to solve this, I propose that we add a keep-all-hangul value (or just keep-hangul), that behaves the same as keep-all for the unicode characters that correspond to hangul, and normal for everything else.

Crissov commented 5 years ago

Sounds like auto to me.

frivoal commented 5 years ago

Well, we already have an "auto" value, which is actually called normal on this particular property. I don't think having both auto and normal would be very understandable. the new values is effectively "normal-for-most-things-but-keep-all-for-hangul", so I'm shortening that to "keep-all-hangul", keeping only the way this value is different from normal in its name.

kojiishi commented 5 years ago

Does it work for Hangul/Hanja mixed-content? cc @jungshik

tabatkins commented 5 years ago

Is there a reason to give Korean a special value here, versus just changing the behavior of "normal" to better reflect current Korean writing practice?

frivoal commented 5 years ago

@kojiishi

Does it work for Hangul/Hanja mixed-content?

Depends what you mean by work. It would break between the Hanja, and not between the Hangul. This is not the ideal behavior, which would also keep Hanja of a single word together, but:

Hangul/Hanja mixed content is increasingly rare, and even when mixed, the Hanja are few.
In cases where the language is not tagged, we cannot know (without using heuristics on the whole content) if we have a case of Korean with mixed hangul/hanja, or a case of multilingual text with Korean mixed with Japanese or Chinese. And if it is the later, disallowing breaks within Hanja/Hanzi/Kanji words would be problematic, as we could end up with long strings of Japanese/Chinese text with no breaks at all.
In cases where the language is tagged, it is possible to use a lang(ko) selector to apply the keep-all value.

@tabatkins

Is there a reason to give Korean a special value here, versus just changing the behavior of "normal" to better reflect current Korean writing practice?

Korean writing/typography culture is undergoing a transition. The keep-all style is increasingly common, but not universal (yet). There are authors who do continue to expect the current normal behavior, and might not be too happy if it changed.

Also, it is likely (I don't have data, but it is logically probable) that there are websites out there that would break if we changed the default:

sites with tightly sized elements, such as menus, buttons, etc, where the content wouldn't fit if line-breaking happened differently
old/traditional text didn't even have spaces at all. Even among authors who do not prefer the keep-all behavior, the modern practice is to put spaces between words. But pages containing older text may not have spaces, and a keep-all type of behavior on these would suppress all/most line breaking opportunities, which would be bad.

jihyerish commented 5 years ago

I'm curious about the reason for separating Korean from Chinese and Japanese.

For my experience, in the text editor or web page for Korean break-all is the default result. I think this is because Hangul is a syllable language and also frequently used Korean words consist of relatively few characters.

frivoal commented 5 years ago

@jihyerish The reason why some people want that behavior, is that Korean (nowadays) uses spaces between words, but Chinese and Japanese don't. Breaking words the same in all 3 languages is the traditional way to do things, and should continue to exist (and to be the default). However, since Korean does have spaces, doing line breaking in Korean the same as in English is also something (some) people want.

jungshik commented 5 years ago

@frivoal I don't think that 'keep-all' is increasingly common. Neither do I buy your reasoning that putting inter-word space is the cause for preferring to have 'keep-all'.

The vast majority of Korean text in books, newspapers, magazines (when the correct typographic standard is adopted) have 'break-all' period. Some ill-typeset documents (especially in 1990's made by poorly i18n'ized DTP software) may use 'keep-all', but that's an aberration !!

Modern Korean orthography always dictates the use of inter-word space (over 80 years at minimum). Yet, breaking at the syllable boundary has been the norm for paragraphs.

Let me tell you what Korean web authors did in mid-1990's when Netscape 1.x didn't do the right thing with Korean line breaking. They wrote a script to insert <wbr> between every syllable pairs to let Netscape 1.x know that there IS a line breaking opportunity at each and every syllable boundary.

keep-all does have its use. keep-all is preferred for Korean when the corresponding English text does NOT want hyphenation. That is, multi-line titles (song, movie, book, article), multi-line ad copies, etc.

However, they're exceptions rather than norm.

Changing 'word-break: normal' behave like 'word-break: keep-all' for Korean is akin to winding the clock back to 1994 (Netscape 1.x behavior).

jungshik commented 5 years ago

One more reason 'keep-all' does not work for Korean is that some Koreans tend to be very fond of German style mega-compound words. So, instead of writing 'Korea University College of Natural Science Department of Physics' (한국대학교 자연 과학 대학 물리학과), they write 'KoreaUniversityCollegeOfNaturalScienceDepartmentOfPhysics' (한국대학교자연과학대학물리학과). I am not a fan of these mega-compound words at all, but a lot of Koreans do use them to my chagrin.

What would happen to those mega-compound words with 'keep-all'?

jungshik commented 5 years ago

keep-all does have its use. keep-all is preferred for Korean when the corresponding English text does NOT want hyphenation. That is, multi-line titles (song, movie, book, article), multi-line ad copies, etc.

Note also that Chinese and Japanese do NOT want line-breaking at any random character boundaries, either in the above cases. They also want line-breaking at word-boundary plus alpha. 'Plus alpha' is for keeping 'particles' and 'non-content bearing words' together with content-bearing counterparts. For instance, even though 'わさだだいがくのがくせい' can be broken into 3 words. the 2nd word (の ; 'of') has to be kept together with the first word in 'titles', 'ad copies', etc.

わさだだいがくの
がくせい

Because CSS does not support this use case (it requires PoS tagging), Google has a library for this use case. See https://github.com/google/budou

Note that this is not for regular paragraphs but for multi-line titles, etc.

jungshik commented 5 years ago

Another way of saying what I wrote above is that 'justified paragraph alignment' has been the norm in Korean typesetting. Justified alignment works best with 'break-all' (break at syllable boundaries). It's similar to English typesetting for 'justified on both edge' works best with hyphenation (at syllable boundary) enabled.

To have 'keep-all' (English equivalent of NO hyphenation) and 'justified alignment', inter-word spacing has to be adjusted (some can be rather large). In CSS, 'text-align: justify' and 'word-break: keep-all' can be used together.

There are cases where 'ragged alignment on the right' is preferred and 'keep-all' is necessary. However, they're not for regular paragraphs but for multi-line titles and ad-copies,etc.

frivoal commented 5 years ago

@jungshik I want to clarify one thing: I am not proposing to change the default behavior of 'word-break: normal' to behave like 'word-break: keep-all'. You are right that "break-all" is and needs to remain the default. I am proposing to add a value, not change the behavior of one.

jungshik commented 5 years ago

Sorry for misunderstanding your proposal.

However, I don't see a strong need for that. 'keep-all' behavior is not preferred by the majority of Korean speakers in those cases (UGC where the langauge of a content is not known in advance) and most other cases (exceptions were noted above). Think about why <wbr> was inserted by a script to force Netscape 1.x to break on any syllable boundaries. I don't think the last 20 years hasn't seen a large shift away from that.

css-meeting-bot commented 4 years ago

The CSS Working Group just discussed Need additional value of word-break for Korean.

The full IRC log of that discussion

<dael> Topic: Need additional value of word-break for Korean
<dael> github: https://github.com/w3c/csswg-drafts/issues/4285
<dael> florian: Reminding people: Korean traditionally written like Japanese without spaces. Now use spaces, but line-breaking has not changes where you can break like Japanese
<dael> florian: Some typographers agree in many contexts it's nice to line-break Korean like English. not everyone agrees with that. Discussion in GH shows that.
<dael> florian: We need another value because the existing 'keep-all' only works if you can lang-tag. Do we care about allowing this behavior for Korean that can't be lang tagged? I think we do.
<dael> florian: If you're writing Korean in a text editor or from a database where you don't have language tags it's tricky to tag on the fly. Amount of magic you have to do is really obnoxious.
<dael> florian: Either we say when editing this behavior is impossible or we say for the Korean alphabet you get the normal or we add keep-all-hangul
<dael> myles: putting hangul in the value doesn't make sense when you use lang
<dael> florian: But you can't put lang on contenteditable section because you don't know what will go in there. If they do a mix of languages you can't language tag. Adding spans on the fly depending on what user types is performance-wise terrible.
<dael> myles: Seems wrong leevl of abstraction. Wish it could be generalized. Worried eventually have 100 different lang specific values.
<dael> florian: Possibly. It's really that there are two normal behaviors so normal can't do the right thing. We need two normals. I don't think there are that many languages that need two normals
<dael> AmeliaBR: That's my concern too. Before settling on language specific keyword do more research to see if more languages have this issue. How much input have you had from general i118n experts beyond Korean use case
<dael> florian: Have not heard of any language. People who would probably know have been involved
<dael> fantasai: I think if there were other languages they would need a keyword. It's separating Hangul from CHinese and Japanese. Most other writing systems don't mix in the name way and not that many that break everywhere like this. I'm not aware of any others that alternate in the same way as Korean
<fantasai> s/keyword/separate keyword/
<tantek> q+ to suggest raising this to i18n here to get broader input from experts in more languages: https://github.com/w3c/i18n-discuss/issues
<dael> jensimmons: I like what florian is proposing. I understand concern on break from purity, but I feel like one thing web didn't do well was support international languages. THis is a way web can keep up with evolving graphic design changes. Feels like a way to make sure web supports a culture and its ability to evolve instead of saying it's complicated and we don't know where it's going to go
<dael> chris: Good idea as long as clearly defined what this value does when don't meet Korean text. I think it's a thing we need. If web started in Korea we would have had this from the start
<Rossen_> q?
<dael> florian: If you're not in Hangul you do the same as keep-all
<Rossen_> ack tantek
<Zakim> tantek, you wanted to suggest raising this to i18n here to get broader input from experts in more languages: https://github.com/w3c/i18n-discuss/issues
<dael> tantek: I want to re-raise something from AmeliaBR. AmeliaBR asked how much input we had from general i18n experts. I want to raise that and propose before we resolve we file and issue on i18n discuss to get input from broader experts.
<dael> tantek: florian saying you haven't heard of other languages isn't quite sufficient
<dael> florian: Reaching out to i18n, yes. But to have a modern language that has the exact same behavior so we can name the keyword something else we need a language who by defaults breaks between every language and want to move away from that and there aren't that many.
<myles> q+
<dael> tantek: I'm saying it shouldn't be dependent on just your expertise. You may be correct, but worth getting that group to take a look.
<dael> myles: Wanted to ask if any thought given on how to impl? Like are there line breaking libraries that impl this behavior?
<dael> florian: This would need to be impl in ICU. ICU seems amenable to this but if we expand ICU would have to expand as well.
<dael> Rossen_: I hear requests to get more from i18n. florian are you okay to do that this week? To get traction or a checkmark to say it's good?
<dael> florian: Yes, I can look into this

litherum commented 4 years ago

Implementation Concern

The fact that this kind of line-breaking isn't implemented anywhere in any line-breaking utility library is concerning. It's difficult to believe that CSS is the first place where software engineers have ever wanted this line breaking behavior. I'd like to discuss this with the ICU maintainers to get their thoughts about this.

Proposal

Adding a new Hangul-specific keyword seems like the wrong design to solve this problem because the values don't stack. It's unlikely that Korean is the only language with two normal-style behaviors. Instead, if we wanted to add script-specific behaviors for languages with two normal-style behaviors, we probably would want to do it with language-specific customizations.

Perhaps something like:

word-break: normal customization(Kore, keep-all)

which would mean "Kore content uses keep-all but everything else uses normal." The first argument would be a ISO 15924 script name, not a lang tag, because this information has to be determined from the raw characters, rather than an out-of-band annotation like lang.

This way, the customizations are stackable: for languages with multiple normal-style behaviors, an author can say "I want normal2 for Korean and normal4 for this other language" when we get around to adding support for customizing that other language.

Limiting Expressiveness

The intent for this proposal is only to select which of the normal-style values should be applied for scripts which have multiple normal-style values. It isn't to select arbitrary line breaking behavior for arbitrary scripts. Therefore, it's important to limit the expressiveness of this proposal to just the cases that actually make sense. In order to limit its expressiveness, either browsers or the spec could list a set of scripts that are accepted here, and this set would initially just contain a single item. If we limit the expressivity in this way, ICU and other line breaking utilities can have flexibility to implement this feature in any way, and browsers don't need lots of custom line breaking code.

Alternative Considered

This proposal could also use unicode blocks instead of unicode scripts, though that would require something like customization(HANGUL_JAMO, ...) customization(HANGUL_COMPATIBILITY_JAMO, ...) customization(HANGUL_SYLLABLES, ...) customization(HANGUL_JAMO_EXTENDED_A, ...) customization(HANGUL_JAMO_EXTENDED_B, ...) in order to get all of Korean.

AmeliaBR commented 4 years ago

+1 to Myles suggestion of considering a more extensible syntax. Maybe worth considering whether an extensible syntax could handle different typographic patterns when it comes to breaks around punctuation, as well.

litherum commented 4 years ago

This issue is interesting because it only makes sense on content that isn't perfectly language tagged. In the general case, we can't always guarantee that all content will be language tagged perfectly. This has implications for many CSS specs.

kojiishi commented 4 years ago

I was under an assumption that CSS WG has a general policy to recommend language tagging. While tagging every word looks a bit too much to me (e.g., English words appearing in Arabic/Japanese text), tagging each document should be reasonable.

frivoal commented 4 years ago

The fact that this kind of line-breaking isn't implemented anywhere in any line-breaking utility library is concerning. It's difficult to believe that CSS is the first place where software engineers have ever wanted this line breaking behavior.

I know of two places that have that.

InDesign has the two behaviors, although they're entangled with something else: depending on whether you justify on or, you get something like normal or something like keep-all-hangul.

It also exists in the Bloomberg Terminal. That previously ran on proprietary software, and now runs on a modified browser engine, which has this ability. They use the keep-all-hangul style as their default line breaking style (in at least some of the applications. I don't have access to a Bloomberg Terminal to check if it's everywhere).

frivoal commented 4 years ago

I was under an assumption that CSS WG has a general policy to recommend language tagging. While tagging every word looks a bit too much to me (e.g., English words appearing in Arabic/Japanese text), tagging each document should be reasonable.

As far as author-supplied text goes, yes. For text that comes from users of the site, I don't think it's practical. See my first comment in this issue for a number of reasons why https://github.com/w3c/csswg-drafts/issues/4285#issue-490847663

kojiishi commented 4 years ago

I think it's practical for the site to add a few lines of code to scan the text and emit keep-all if the content has Hangul code points. They would then have better control over CSS properties.

I'm afraid we will lose the reasons to recommend language tagging, because there are cases where it is not possible to add languages without scanning the content. Why do we recommend a page to be tagged as ja when there are Kana characters?

Crissov commented 4 years ago

@litherum Why introduce something new when we can already use wildcards in the language pseudo-class and rely on “the :lang(C) pseudo-class uses the UA’s knowledge of the document’s semantics to perform the comparison” for interactive content, i. e. user input? Perhaps add a note to that effect.

:lang(ko, 
  und-Hang, mul-Hang, "*-Hang", 
  und-Kore, mul-Kore, "*-Kore"
) 
{word-break: keep-all;}

If the pseudo-class should only rely on author-supplied, explicit metadata, CSS could introduce a dedicated (highlighting) pseudo-element for writing systems or scripts:

::char(Hang, Kore) 
{word-break: keep-all;}

frivoal commented 4 years ago

I think it's practical for the site to add a few lines of code to scan the text and emit keep-all if the content has Hangul code points. They would then have better control over CSS properties.

While this would not be hard to write if you put keep-all on the whole text, it would also not serve the need: multilingual text exists, even in user generated content, and would be broken by this rule. Let's say I want to tweet / email / blog / write into a word-365 document / ... the first sentence of the Japanese wikipedia page about Seoul (https://ja.wikipedia.org/wiki/%E3%82%BD%E3%82%A6%E3%83%AB%E7%89%B9%E5%88%A5%E5%B8%82). This is mostly Japanese text, which would be broken by applying keep-all, but it contains 4 syllables of hangul, which would trigger the kind of script you mentioned. Sure, it isn't appropriate to use this script if you know the the content is going to be in Japanese, but the author of the twitter / gmail / wordpress / office 365 / ... doesn't know what the content is going to be when it is content from users. That doesn't mean they can't have an opinion of who text ought to be typeset if it is or contains hangul.

On the other hand, it would be hard to write the kind of script that generate a span around the bits that are in Hangul to put keep-all on, and leave the rest alone.

kojiishi commented 4 years ago

That is exactly why I recommend scripts to do this kind of work. With script, authors have full control for how keep-all should apply. They could run full linguistic analysis or ML if they want to, or just code point check if that's good enough for them. It's flexibility and extension points CSS properties cannot provide.

I'm not happy but ok to add the value to the sepc, but as @litherum said, if we were changing the policy and start saying we don't recommend language tagging for twitter/gmail/wordpress/Office/etc., I think we should apply that policy to all other CSS properties too, such as hyphen.

frivoal commented 4 years ago

The kind of script that would reliably do what keep-all-hangul (or normal customization(Kore, keep-all)) does on arbitrary content in an editable element is ridiculously hard to do. You've got to deal with interactions with the spell checkers replacing content, with IMEs, with browser undo, with on-the-fly changes to the markup caused by browser-built-in formatting commands… and it cannot be factored out into a sharable library that works on all sites, because it would needs to integrate with the various rich text editing frameworks, which all tend to treat the DOM as a the V of an MVC model, and would overwrite any change made there by a third-party script. This isn't practical.

kojiishi commented 4 years ago

Inserting spans is very common technique on the web. I don't understand why it's so hard, impossible, nor practical.

As I said, I'll not make a formal objection if other browsers want to implement, but this doesn't look like a good primitive to add to the platform and that I'll be opposed to implement in Blink.

frivoal commented 4 years ago

Inserting spans in static content is easy. Inserting spans in (rich text) content while it is being edited is not.

frivoal commented 4 years ago

I was actioned to ask i18n for investigation of other languages where something like this is relevant. This is happening over there: https://github.com/w3c/i18n-discuss/issues/11

fantasai commented 1 year ago

Pending the i18n investigation, I'm seeing three ways forward here:

Option 1: Do nothing. This doesn't satisfy the use case.
Option 2: Add keep-hangul to word-break to switch Hangul to behave like Latin while leaving everything else as normal. And follow this same break-/keep- pattern with other scriptnames if we have a few other cases to address.
Option 3: Add generic keep(<scriptname>+) and break(<scriptname>+) functions to word-break. These would diff against the normal baseline behavior.

Recap: The fundamental behavior of word-break is to switch certain Letters from behaving like CJK (break behavior) to behaving like Latin (keep behavior) or vice versa: break-all applies break behavior to all letters, and keep-all applies keep behavior to all letters. The proposal here is, essentially, to allow subsetting the switch to Hangul.

IMHO: If there aren't use cases for more than one or two values in break() and keep(), then we are better off with the one-off keywords, provided they consistently follow the behavior and syntax pattern that we would use for break() and keep() (which we should comment into the spec for future us).

nmccully commented 1 year ago

InDesign conflates the hyphenation setting with the choice of breaking anywhere or on space with Korean, which we have heard feedback about from users requesting its own distinct setting. Choosing to break on space or break anywhere does relate to other user choices, such as full-justifying the text or creating legal documents where meaning is critical and line breaks can change meaning. We have also heard that the default should move away from break anywhere to breaking on space with some user control to break where they want, for what it's worth.

w3c / csswg-drafts