w3c / i18n-discuss

A place to hold discussions on i18n topics, and to put documents that summarise, support or initiate those discussions.
16 stars 10 forks source link

Languages / writing systems with 2 line breaking conventions in common use? #11

Open frivoal opened 4 years ago

frivoal commented 4 years ago

Are there writing-systems other than Korean/Hangul that meet the following criteria:

Context: the CSS-WG is planning to introduce a new value to the word-break property, that behaves like normal except for hangul, where it would have behavior (ii) (the same as keep-all). If this is only useful to Korean, then the name of the value can be specific to korean (i.e. keep-all-hangul). If some other language would want to use it, then the value should be named something more generic, and the behavior adjusted to handle that other language as well.

The reason keep-all is insufficient to serve this need is that not all content can be language tagged (for instance, user generated content in an editable text field isn't), and keep-all is neither appropriate as a default for all languages, not is it appropriate to content that contains any amount of Korean, multi-lingual content exists, and keep-all would not be appropriate for Korean mixed with Japanese (for instance). So we need a second value that's like normal, but with behavior (ii) instead of (i) for hangul.

frivoal commented 4 years ago

Additionally, if there are languages with two line breaking behaviors in common use, where the default (as in, the behavior of word-break: normal) is the other way around and which would benefit from being able to opt into a normal-with-break-all-for-a-certain-script, that too would be useful to know.

r12a commented 4 years ago

Hmm. Not sure.

http://w3c.github.io/elreq/#ethiopic_line_breaking and http://w3c.github.io/elreq/#ethiopic_hyphenation indicate that languages using the Ethiopic script break character by character, regardless of whether space or the word-separator are used between words. However, major browsers actually break on word boundaries (space or word-sep), and i'm not sure whether that might be establishing a new expectation. @dyacob any thoughts on that?

frivoal commented 4 years ago

As far as I can tell, browsers do that because Unicode tells them to: https://www.unicode.org/Public/UCD/latest/ucd/LineBreak.txt classifies Ethiopic syllables as AL, which by UAX14 prohibits breaks between pairs of such letters.

But given the explanation in elreq, that actually makes sense: when ethiopic was primarily written with word separators, using a break-all style of line breaking was fine, but with the advent using spaces, line breaking anywhere becomes somewhat ambiguous.

So, what elreq currently describes seems to be the historic reality that breaking between all letters was the common practice. What it doesn't say is whether there's a continued desire for this behavior.

r12a commented 8 months ago

@dyacob is it reasonable to assert that, although it is mostly used for historic text, some modern content authors of text using Ethiopic orthographies still sometimes want the line to break before the last character that fits, rather than wrapping whole words? This is so that Florian can decide whether to name his line-break property value with a generic or a Korean-specific name.

Do you know of other orthographies that behave like Korean?

Personally, i think a generic name would be best because even if modern content authors generally don't expect the text to break like Korean, people writing expository texts about archaic scripts will probably also need this.(?)

dyacob commented 8 months ago

@r12a I think that is very reasonable to say, particularly for content authors targetting web media. In print media, the desire is greater to have the inner-word breaking. I would imagine that other scripts that historically used a printed wordspace would behave like Ethiopic with respect to breaking.

I don't know of others scripts that behave like Korean ("unbreakable" if I'm understanding it correctly).