w3c / jlreq

Text Layout Requirements for Japanese
https://w3c.github.io/jlreq/
Other
101 stars 17 forks source link

Proposal from Eric Muller: re: expanding JLReq character class to Unicode #242

Open kidayasuo opened 3 years ago

kidayasuo commented 3 years ago

Eric Muller posted a proposal on the admin list. I am copying it here to track discussions.

a sub issue of: #240 ––––––––––––––––––––––––––––––––––––––––––––––

2020/10/19 8:23、Eric Muller emuller@amazon.com wrote:

Here is our perspective as implementers. It is a bit raw (sorry, we noticed the announcement a bit late), don't hesitate to reach out for clarification.

Eric.


Character classes serve two purposes: linebreak opportunities and spacing around characters.

Linebreak opportunities are adequately handled by Unicode currently, at most needing some adjustment in UAX14 or in the CLDR language tailorings. Therefore that use is not discussed here.


A possible spacing model is that there is glue (variable space) on each side of each grapheme cluster occurrence. This glue is characterized by its natural width (JLREQ appendix B) and can be deformed (either compressed - JLREQ appendix D - or expanded - JLREQ appendix E) to achieve justification.

While each glue occurrence could be specified explicitly via markup, it can be determined most of the time from its context, using classes: for a left glue, by the class of what's on the left of the grapheme cluster occurrence and by the class of the grapheme cluster occurrence itself; and similarly for a right glue, by the class of the grapheme cluster occurrence and by the class of what's on the right of the grapheme cluster occurrence.

What's on the left (or right) of a grapheme cluster occurrence may be another grapheme cluster occurrence, in which case the class of "what's on the left" is the class of that other grapheme cluster occurrence. But it can also be that there is no other grapheme cluster occurrence on the left, or there is some intervening graphical element, thus leading to classes:

The class of a grapheme cluster occurrence could also be specified explicitly by markup, but it can often be determined from the characters composing the grapheme cluster occurrence (at which point, it is the same for all occurrences of a given grapheme cluster). That can in turn be determined from classes assigned to the characters in the grapheme cluster. Generally, the base character of a grapheme cluster determines the class of the grapheme cluster, but there are cases where the other characters "dominate" the determination: for example, <U+00A0 NO-BREAK SPACE> may be in a class, and <U+00A0 U+0301 COMBINING ACUTE> may be in a different class.

Finally, we arrive at the classes of characters. Below is a proposed assignment for the whole Unicode repertoire. This classification mostly aligns with that of JLREQ, with a few differences:

The proposed assignment also mentions the UAX50 vertical orientation property, as it is closed aligned and informs the spacing class assignment.


Ambiguous characters

While most characters are unambiguously in a class, regardless of their context, a few characters common in Japanese typography are inherently ambiguous:

U+2018 ‘ LEFT SINGLE QUOTATION MARK
U+201C “ LEFT DOUBLE QUOTATION MARK
U+00AB « LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
U+2019 ’ RIGHT SINGLE QUOTATION MARK
U+201D ” RIGHT DOUBLE QUOTATION MARK
U+00BB » RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
U+2010 ‐ HYPHEN
U+2013 – EN DASH
U+203C ‼ DOUBLE EXCLAMATION MARK
U+2047 ⁇ DOUBLE QUESTION MARK
U+2028 ⁈ QUESTION EXCLAMATION MARK
U+2049 ⁉ EXCLAMATION QUESTION MARK
U+00B7 · MIDDLE DOT
U+2022 • BULLET
U+2014 — EM DASH
U+2026 … HORIZONTAL ELLIPSIS
U+2025 ‥ TWO DOT LEADER

A possibility is to resolve those based on the locale, or their resolved script (itself determined by looking at the script of the adjacent character).

The locale method has the downside that authors are not always tagging their text appropriately (either not at all, or not carefully on punctuation).

The script method has the advantage of not requiring the author's help, and that computation is already necessary in OpenType layout engines.


Inseparables

Currently, all inseparables are lumped in a single class, and a footnote explains that the behavior inseparable/inseparable applies only to two occurrences of the same inseparable. It would be better to have separate classes for inseperables. Not only does that avoid a footnote, but it also means that one can specify different glues for e.g. ideographic/inseparable_emDash and ideographic/inseparable_twoDotLeader, or specify different glues for inseparable_emDash/inseparable_twoDotLeader and inseparable_emDash/inseparable_ellipsis.


Logical vs visual order:

It should be made clear that the practical definition of glues is in the visual space: that's why we used the terms "left" and "right".


Classes as a Unicode property

From a practical point of view, I believe that the spacing class should be part of the Unicode Character Database, as a property, just like the vertical orientation property. The main reason is that this is the most reliable way to get a something well defined (in the sense of having a definition, not necessarily in the sense of having correct values), and in sync with the Unicode repertoire. It is a relatively easy task for Unicode, as has been demonstrated with the vertical orientation property. (In fact, the very first draft of what because UAX50 included the spacing class).

It is worth nothing that such a Unicode property is only a starting point. As noted earlier, markup should always be available to influence the determination of the glue. Thus there is no need for such a Unicode property to be perfect; it does however need to be easily accessible and fairly stable.

======== Classes and glue settings

The classes are only one part of the final visual appearance: the glue settings also come into play, so it is worth discussing those a bit, as they may influence the design of the classes.


Glue settings and justification

When justifying text (a common case for body text), implementation may have to expand a glue to an arbitrary width. Consider for example a two character paragraph, with text-align-last: justify, the glue has to be (linewidth - 2em). While large glues are sometime the result of pathological conditions, they can also be explicitly intended, such as in jidori processing. Thus it is desirable to allow pretty much all glues to grow to indefinitely.


Glue settings are mostly for body text

JLREQ currently describes three glue settings (default, JIS, and book, in tables 3-5 of appendix D; they differ only on the behavior when compressing lines, but in principle different settings could also differ on natural width or when expanding lines). It seems that those setting are mostly concerned with body text, and are not appropriate for, e.g., titles. For example, the default method specifies 0 glue between paragraph (line) start and an opening bracket, and 0.5em between a closing bracket and a paragraph (line) end; for a title starting and ending with brackets, which happens to be set on two lines (centered and not justified), this assymetric can be jarring.

It would be worth having a discussion that the settings apply to body text and to mention when they are not appropriate, or even better to include setting for other other situations. The most important situation that come to mind: titles, and ruby base/ruby text.


Interchange of glue settings

The discussion so far has been about determining the classes from characters, leaving room for document styling systems (e.g. CSS) to let authors explicitly specify classes of occurrences. The classification is of course only one part of the final result, the other being the glues that result from those classes (i.e. JLREQ appendices B, D, E). It would be useful to encourage document styling systems to allow the specification the glues as well, in the documents, either in the form of selecting from a predetermined set of settings, or by completely specifying the settings (may be as delta on top of the predetermined settings).


Spacing classes and the CSS text-indent property.

With the model presented above, the CSS text-indent property is essentially an unconditional, invariable glue between to the left of the first grapheme in a paragraph. In practice, it is useful in Japanese typography to make that glue at least conditional: e.g. 1em before an ideograph, and 0.5em before an opening bracket. I think the best way forward is to recommend that for paragraphs using the spacing model discussed here, that glue be controlled by the spacing model (i.e. the mojikumi tables) and that text-indent be set to 0.

========

Columns:

===========

kidayasuo commented 3 years ago

Eric, thank you very much for your proposal.

I have several comments / questions regarding your proposal. I would greatly appreciate it if you could clarify.

grapheme cluster vs grapheme cluster occurence

What is a difference between A and A occurence in this context? (may be this is simply a novice English question)

it distinguishes the class used in horizontal and in vertical texts

What is the benefit of having different classes between horizontal and vertical? I would appreciate it if you could elaborate here a bit.

The locale method has the downside that authors are not always tagging their text appropriately (either not at all, or not carefully on punctuation).

I could not figure out what “not carefully on punctuation” means…

but it also means that one can specify different glues for e.g. ideographic/inseparable_emDash and ideographic/inseparable_twoDotLeader, or specify different glues for inseparable_emDash/inseparable_twoDotLeader and inseparable_emDash/inseparable_ellipsis.

I can see why separating each inseparable makes sense, because that way you can construct a state machine. Regarding your second reasoning, i.e. ability to define different glues, do you have concrete examples of why such ability is a good thing?

As noted earlier, markup should always be available to influence the determination of the glue. Thus there is no need for such a Unicode property to be perfect; it does however need to be easily accessible and fairly stable.

I am not sure if I understood the discussion here. There will be systems / applications where such markup is not possible (due to limitation of the underlaying engine, or limitation of the UI), and if so, existence of a markup could not be the reason why the property does not need to be perfect. I am not necessarily saying it needs to be perfect, as it can’t be.

The classes are only one part of the final visual appearance: the glue settings also come into play, so it is worth discussing those a bit, as they may influence the design of the classes.

Do you have examples of the glue design giving influence on the design of the character classes?

I think the best way forward is to recommend that for paragraphs using the spacing model discussed here, that glue be controlled by the spacing model (i.e. the mojikumi tables) and that text-indent be set to 0.

Do you suggest that there will be a part of text where this model is applied and another part of text where it is not applied, instead of defining everything in one spacing model? If you have multiple models, especially in one document, it would become harder to for example set an uniform style, e.g. indentation.

Columns: I have not yet carefully looked at the character class assignments you proposed but how did you handle JLReq classes that are dependent on the layout, such as ruby base (cl-20, 21, 22, 23, 24, 25, 28, 29, 30)?

0x000021 | R | H | westernChar 0x000080 | R | H | V | unknown

Is it correct to interpret that code points between 0x000021 and 0x000080 follow properties specified by 0x000021?

Lastly it would be great if you could provide a cross reference between JLReq classes and classes in your proposal.

Thank you!

emuller-amazon commented 3 years ago

grapheme cluster vs grapheme cluster occurence

What is a difference between A and A occurence in this context? (may be this is simply a novice English question)

In the string "moto", we have the graphemes "m", "o" and "t". The grapheme "o" occurs twice. In this particular example, there is no reason to treat the two occurrences differently, but it may be useful in other cases. For example, a U+2019 could occur (along with a U+2018) to quote some Japanese text, in which case that occurrence could be cl-02 closing bracket, and there could be another occurrence, in the same line, where it is apostrophe in an English word (e.g. in "don’t"); that other occurrence could be a cl-27 western character.

What is the benefit of having different classes between horizontal and vertical? I would appreciate it if you could elaborate here a bit.

Take the case of U+00C6 Æ. It is my understanding that when in horizontal text (or in sideways in vertical text), the most likely desired behavior is cl-27 western char, whereas when it is upright in vertical text, the most likely desired behavior is cl-19 ideographic character.

I could not figure out what “not carefully on punctuation” means…

Consider some text: [japanese]“[english]”[japanese]. If we use the locale to determine the class of the occurrences of “ and ”, then we are making a distinction between:

<span xml:lang='ja'>[japanese]<span xml:lang='en'>“[english]”</span>[japanese]</span>

and

<span xml:lang='ja'>[japanese]“<span xml:lang='en'>[english]</span>”[japanese]<span>

i.e. on whether the quotes are inside or outside the English span. I don't think we can rely an authors to master the difference between the two, especially if they use a wysiwyg editor, where the difference is difficult to "see".

I can see why separating each inseparable makes sense, because that way you can construct a state machine.

Yes. In JLREQ, right now, this is handled by a footnote. But every footnote means an "exception" in the code.

Regarding your second reasoning, i.e. ability to define different glues, do you have concrete examples of why such ability is a good thing?

More flexibility for the book designer can only be a good thing. As a concrete example, consider the pair <ideograph> <emdash>. The glue between those two can grow for justification purposes. It seems reasonable to have the glue between <vertical kana repeat mark lower half> and <emdash> grow in the same way; but because they are currently both inseparable, you can't express that (or you have to use a footnote...).

As noted earlier, markup should always be available to influence the determination of the glue. Thus there is no need for such a Unicode property to be perfect; it does however need to be easily accessible and fairly stable.

I am not sure if I understood the discussion here. There will be systems / applications where such markup is not possible (due to limitation of the underlaying engine, or limitation of the UI), and if so, existence of a markup could not be the reason why the property does not need to be perfect. I am not necessarily saying it needs to be perfect, as it can’t be.

I definitely have a bias for environments with markup (whether edited directly - e.g. HTML in a text editor - or indirectly - e.g. InDesign, with UI to effectively edit the underlying markup). But I also think it is the dominant use case (even here in github I have some markup). And I am not sure you one go very far without markup, especially in Japanese typography.

I do however think that both plain text and markup need something accessible and stable more than something perfect.

Do you have examples of the glue design giving influence on the design of the character classes?

Keeping in mind that I excluded line breaking, the classes serve only to determine what glue to use. So if you completely ignore the glue, there is no need to have classes. A bit more concrete: if Hiragana and Katakana always have the same glue behavior, then there is no need to have two different classes. When defining JLREQ, the classes are a consequence of the glues (of course, when implementing JLREQ, the glues are a consequence of the classes)

Do you suggest that there will be a part of text where this model is applied and another part of text where it is not applied, instead of defining everything in one spacing model? If you have multiple models, especially in one document, it would become harder to for example set an uniform style, e.g. indentation.

The use case I have in mind is a mixed language documents, eg. Japanese books with extensive (paragraph level) quotes in English, or dual language (e.g. left page in Japanese, right page in English).

Nobody is forced to use different mojikumi settings (or even Japanese vs. non Japanese spacing) in a document, but some people need to.

I have not yet carefully looked at the character class assignments you proposed but how did you handle JLReq classes that are dependent on the layout, such as ruby base (cl-20, 21, 22, 23, 24, 25, 28, 29, 30)?

In general, the context will first determine the class of an occurrence. When the context does not provide the answer, then we go to the mapping character → class. So when we have ...X<ruby>...</ruby>..., at the position between X and <ruby>, the class on the left comes from the class of X, and the class on the right comes from <ruby>, i.e this is a cl-27 / cl-22. In this particular case, the markup had another purpose, but one could have markup where the only effect would be to set the classes.

To be more complete, because we are talking presentation, CSS rather than HTML would be the proper place, and the default HTML stylesheet would have something like ruby { jlreq-class: cl22 }. (well; it's a bit more complicated, but you get the idea).

0x000021 | R | H | westernChar 0x000080 | R | H | V | unknown

Is it correct to interpret that code points between 0x000021 and 0x000080 follow properties specified by 0x000021?

Yes.

Eric.

PS: thank for helping with the translation

(Notes from @xfq: I made some edits to this comment to make the examples clearer.)

kidayasuo commented 3 years ago

Thank you Eric for clarification. They cleared up most of my questions.

Regarding the necessity to markup vs spacing by unicode property (which you explained that it does not need to be perfect), I always have a desire to bring up the quality of "everyday text", such as email / github (for some people :) / notes, etc. where people generate and read text every day and almost never markup. In that sense I am hoping that the layout be pretty good without any markups.

Regarding the mixed language case, do you think it is possible to cover spacing of all languages, including English, in this single model? It would make things much simpler.

kidayasuo commented 3 years ago

oh, @emuller-amazon san one more. A few people asked if the "glue" in your proposal and "glue" concept used in TeX (or variations of it) are the same thing. If there are differences a comparison might help people understand your proposal better. thanks

emuller-amazon commented 3 years ago

Re: glue.

Yes, the concept is similar to TeX, but needs some refinement for Japanese, to account for JLREQ's priority between the various pairs of classes. That is, we need some way to model the fact that the glues in, e.g. cl-04/cl-21 expands before the glues in cl-19/cl-19 (i.e the color coding in appendices D & E).

What I ended up with is that the width of a glue is a piecewise linear, monotonic, increasing, continuous function of some parameter t. This parameter t corresponds to how much the glue is compressed (t < 0) or expanded (t > 0). A t=0, the width is given by appendix B. I arbitrarily decided that the first step of expansion is between t=0 and t=1, the second step (red in App. E) is between t=1 and t=2, the third step (blue in App. E) is between t=2 and t=3, etc. Similarly for the compression and App. D.

The width of a letter is just a constant function.

Then you can sum up all the glues and letters on a line, by doing the point-wise addition of those functions, and you have the width L(t) of that line. It really captures all the (scalar) width the line could take, depending on how much you compress or expand it.

The addition of piecewise linear, monotonic, ... functions is also a piecewise linear, monotonic, ... function. To justify the line to a target width W, you just need to figure out the value of t_justified such what L(t_justified) = W, and that's easy because of the nature of the functions. Then, for each glue in the line, its width is w(t_justified).

My copy of JIS X 4051 is at the office, but IIRC, it has an appendix that has a whole bunch of equations, that correspond to appendix B, D, E of JLREQ. Those equations are a bit convoluted, but what I described is just a restatement of those equations.

There is a little complication because some of the glue are not continuous: the glue between a punctuation and line end can only be 0 or 1/2em, nothing in between. But it's only at the end of lines, so I special cased it. I think it should be possible to deal with non-continuous functions, but I have not done that yet.

All that works also well with ruby (including jukugo, and all the various overhang conditions), and I think it even works well with multiple ruby lines but I have not tried yet. Exercise left to the reader. Hint: model ruby as a graph where nodes are positions on the main line and on the ruby line(s), and arcs represent constraint of the form "this position must be to the left of that position". Compute the width of the graph much like the width of the line above, and use a point-wise max() for those nodes that have multiple incoming arcs. Solve as before. Use that t_justified for the longest path(s) in the graph, and justify/align independently the pieces that are not on the longest path (typically, ruby texts or ruby bases).

emuller-amazon commented 3 years ago

Re: plain text vs. markup

Of course, I would want something that works for plain text as well. Even with markup, there is a strong desire to minimize the quantity of markup that is needed, so there is no opposition on the goal between markup and plain text.

But we also need to realize that all the "virtual" classes of JLREQ (cl-18 math operator, etc; more or less, the classes that have to be deduced from the context) will be the result of either markup or heuristics, and for plain text we only have heuristics. What I am not excited about is spending too much time building complicated (and therefore delicate) heuristics (but somebody else can), and I would not be happy if the support of those heuristics makes the markup case more complicated. That could happen if the author of markup ends up asking himself all the time "does the heuristic work in this case, or do I need to add markup".

emuller-amazon commented 3 years ago

Re: mixed languages.

I don't think we have an effective technology for Arabic justification (yes, OpenType has the JSTF table, but I don't think it is viable). So we need to set that aside.

But other than that, all the other justification systems I know of are subsets of the Japanese case: the glue stuff I described earlier can be used to deal with space and letterspacing of Latin, which is pretty much what's used everywhere outside of Japanese/Chinese. Just rename cl-27 to "non-japanese characters" (of course, something better, but you get the idea).