w3c / csswg-drafts

CSS Working Group Editor Drafts
https://drafts.csswg.org/
Other
4.51k stars 669 forks source link

[css-text] Better describe the likely outcomes of hyphenation (editorial) #5973

Closed r12a closed 1 year ago

r12a commented 3 years ago

8 Breaking Within Words https://drafts.csswg.org/css-text-4/#hyphenation https://drafts.csswg.org/css-text-3/#hyphenation

I think it would be worthwhile to add a note which explains that hyphenation should produce a number of effects, depending on the language in question, and give examples, in order to remind implementers to implement a solution that is open to cultural adaptation. These examples include:

  1. In some cases the spelling of a word needs to be changed around hyphenation, for example in Dutch cafeetje → café-tje and skiërs → ski-ers, and in Hungarian Összeg → Ösz-szeg.
  2. The symbol used to indicate that a word was broken at the end of a line is not always one that looks like a hyphen. Cree uses ᐀ [U+1400 CANADIAN SYLLABICS HYPHEN], Armenian uses ֊ [U+058A ARMENIAN HYPHEN], Balinese uses ᭠ [U+1B60 BALINESE PAMENENG], etc. Some languages, such as Malayalam or Tamil may produce no visual marker.
  3. The location of the mark is not always at the end of the line – some languages put it at the start of the following line, such as Traditional Mongolian.

It should also be made clear that such effects are triggered not only by browser code applying algorithms, but by ­ or wbr (see https://github.com/w3c/csswg-drafts/issues/5972) when they fall within a range to which the hyphens property has been applied (with relevant values), and that ­ should only produce a glyph at the end of a line that looks like a hyphen if that is appropriate for the language in question.

faceless2 commented 3 years ago

Polish language is one I believe? I understand the OpenOffice hyphenation rules for Polish apply a hyphen to both the end of the first line and the start of the next.

As you're looking at this, we noticed that hyphenate-character allows you to override the defaults, but it doesn't allow you to specify whether your overridden character is at the end of the first line or the start of the second (or both). Two easy ways to specify this would be to give hyphenate-character an optional second argument, eg hyphenate-character: "" "-", or a single string with a newline to separate the two values, eg hyphenate-character: "-\A-". How necessary this is, I don't know.

r12a commented 3 years ago

Here's another example of the visual marker appearing at the beginning of a line. Unicode Standard, v11, p536:

In writing Mongolian and Todo, U+1806 mongolian todo soft hyphen is used at the beginning of the second line to indicate resumption of a broken word. It functions like U+2010 hyphen, except that U+1806 appears at the beginning of a line rather than at the end.

frivoal commented 3 years ago

Do you have a suggestion for a short example that we could include, for instance right after the first paragraph of 5.4? If we do add something, it would be good to keep it short, just to illustrate the point that hyphenation can be different / more complicated than what is typical of English. But I wouldn't attempt to list too many cases. As much as I find that sort of things interesting, css-text cannot scale to describing all the peculiarities of all the world's languages :)

frivoal commented 3 years ago

(Given generic language in the spec that generically allows and expects "the right thing" for all languages, automated tests in wpt might be a more effective place to highlight the specificity of various languages)

xfq commented 3 years ago

The line breaking / hyphenation of pinyin can be an example too, but it may be less common than the above examples (and may be more suitable for css-ruby?).

(Related clreq issue: https://github.com/w3c/clreq/issues/351)

fantasai commented 1 year ago

@r12a @xfq We've added a short table of examples illustrating the spelling changes (which are normatively noted in the paragraph above) here: https://drafts.csswg.org/css-text-3/#hyphenation

If you have other examples you want to add, we can do that; but please remember we're not trying to make the spec examples exhaustive. :) It might be useful to compile your more exhaustive notes into the Typography index, though, and we can link there if you want.

We also clarified the spec to say that hyphenation character changes must, and spelling changes should, apply. (The SHOULD is because, if the spelling differs between hyphenated and unhyphenated forms, depending on where the author ended up inserting the UA might not be able to match up the author's chosen hyphenation opportunity against its hyphenation dictionary.)

We did not make any changes for WBR, see @frivoal's comments in https://github.com/w3c/csswg-drafts/issues/5972#issuecomment-826582035 and https://github.com/whatwg/html/issues/6326#issuecomment-826595860 . Note that if HTML does introduce a way to mark up explicit hyphenation opportunities in the future, the spec is written to be generic to such mechanisms already.

Agenda+ for CSSWG review.

css-meeting-bot commented 1 year ago

The CSS Working Group just discussed Better describe the likely outcomes of hyphenation (editorial), and agreed to the following:

The full IRC log of that discussion <emeyer> Topic: Better describe the likely outcomes of hyphenation (editorial)
<emeyer> github: https://github.com/w3c/csswg-drafts/issues/5973
<fantasai> -> https://github.com/w3c/csswg-drafts/issues/5973#issuecomment-1366321015
<emeyer> fantasai: We added examples
<emeyer> …and made a normative change to say a hyphenation character property app;lis to soft hyphenation opportunities
<emeyer> s/app;lis/applies/
<emeyer> …if there’s supposed to be a spelling change in a hyphenated word, you should apply that
<emeyer> …we want the UA to make a best effort but it may or may not match up
<fantasai> normative changes -> https://github.com/w3c/csswg-drafts/commit/03935eae48ead18beed74ff665f8724c532b49a9
<emeyer> florian: Examples were added to show these sorts of situations
<emeyer> Rossen_: Anything else?
<emeyer> (silence)
<emeyer> Any objection to these changes, or do we need more time?
<emeyer> florian: We did get a heart emoji on the issue, so there’s that
<emeyer> RESOLVED: Accept changes
r12a commented 1 year ago

@r12a @xfq We've added a short table of examples illustrating the spelling changes (which are normatively noted in the paragraph above) here: https://drafts.csswg.org/css-text-3/#hyphenation

[1] The Uighur example is missing the 'hyphen'. It should be a short baseline extension, separated from the last letter by a small space. Here's an example. It's not entirely clear how the line should be produced. Some say that the font should automatically drop and lengthen a normal hyphen, but others say you should use ـ U+0640 ARABIC TATWEEL. In the meantime, perhaps an SVG image would be better here.

[2] Although the introductory text mentions that other symbols may be used, rather than a hyphen, the list of examples doesn't back that up convincingly - it only shows hyphens. I can provide one extra example for you, but how would you like it? I can provide text, but others may not be able to see the text, or i could provide an SVG image which could be displayed at approximately normal text size.

fantasai commented 1 year ago

@r12a The backing store is actually using U+0640 but it looks like it needs some kind of thin space to create the visual separation. What do you recommend here?

r12a commented 1 year ago

How about using these images:

damydi_full damydi_initial damydi-final

If that works for you, i can provide another set to show another non-hyphen based hyphenation in a different script.

r12a commented 1 year ago

PS: If you like, you can also use those images plus the following 2 for Example 18, which looks a little ragged as a bitmap.

damydi_initial_wrong damydi_final_wrong

frivoal commented 1 year ago

@r12a I've updated the spec to use the images you provided.

As for this:

Although the introductory text mentions that other symbols may be used, rather than a hyphen, the list of examples doesn't back that up convincingly - it only shows hyphens. I can provide one extra example for you, but how would you like it? I can provide text, but others may not be able to see the text, or i could provide an SVG image which could be displayed at approximately normal text size.

Should we consider that the Uyghur example is using a U+0640 ARABIC TATWEEL and call it done, or do you want to supply some alternative example? If you want to offer something else, SVG is indeed good, as that provides reliable rendering.

r12a commented 1 year ago

I don't think we need to worry (for this context) about which character is used if we use the images.

The answer to the question about which character should be used – for implementers of Uighur hyphenation – is not clear, afaict even among the Unicode folks, and needs further discussion. My personal preference is to use tatweel, fwiw.

frivoal commented 1 year ago

At this point, I am not sure what the request is on the spec. Do we consider the examples already present good enough to show some diversity, or not?

r12a commented 1 year ago

@frivoal i think we're almost done, but here are some final suggestions:

  1. Per the discussion in this comment, I would like to add one more line to the Hyphenation Across Languages examples, to show a hyphenation mark in LTR text that is clearly not a hyphen. I suggest using the svg images for Plains Cree at the bottom of this comment field.
  2. I don't think it helps much to have to expand example 16 to view it – the table isn't huge, so i'd remove the details tag.
  3. The first para in 5.4 says "and visually indicating the split (usually by inserting a hyphen, U+2010)" (my emphasis). My view is that, even though we call this 'hyphenation', that is a western bias, and 'hyphenation' actually relates to splitting of words to fit on a line, and doesn't necessarily indicate that there needs to be a visual indicator. For example, Malayalam needs hyphenation rules but uses no visual indicator – see an example. I think the text could say "usually indicating the split", or we could drop that phrase.

SVG images for Cree example:

kasitaniwaninik kasitani- waninik

frivoal commented 1 year ago

Done (https://github.com/w3c/csswg-drafts/commit/af3f01ae51186efac25ca428d63b73c02ff080b1). Also added a test in WPT for Cree (https://github.com/web-platform-tests/wpt/pull/42523). Thanks for supplying this example.