Proposal from Eric Muller: re: expanding JLReq character class to Unicode

kidayasuo commented 3 years ago

Eric Muller posted a proposal on the admin list. I am copying it here to track discussions.

a sub issue of: #240 ––––––––––––––––––––––––––––––––––––––––––––––

2020/10/19 8:23、Eric Muller emuller@amazon.com wrote:

Here is our perspective as implementers. It is a bit raw (sorry, we noticed the announcement a bit late), don't hesitate to reach out for clarification.

Eric.

Character classes serve two purposes: linebreak opportunities and spacing around characters.

Linebreak opportunities are adequately handled by Unicode currently, at most needing some adjustment in UAX14 or in the CLDR language tailorings. Therefore that use is not discussed here.

A possible spacing model is that there is glue (variable space) on each side of each grapheme cluster occurrence. This glue is characterized by its natural width (JLREQ appendix B) and can be deformed (either compressed - JLREQ appendix D - or expanded - JLREQ appendix E) to achieve justification.

While each glue occurrence could be specified explicitly via markup, it can be determined most of the time from its context, using classes: for a left glue, by the class of what's on the left of the grapheme cluster occurrence and by the class of the grapheme cluster occurrence itself; and similarly for a right glue, by the class of the grapheme cluster occurrence and by the class of what's on the right of the grapheme cluster occurrence.

What's on the left (or right) of a grapheme cluster occurrence may be another grapheme cluster occurrence, in which case the class of "what's on the left" is the class of that other grapheme cluster occurrence. But it can also be that there is no other grapheme cluster occurrence on the left, or there is some intervening graphical element, thus leading to classes:

the beginning (or end) of a paragraph
the beginning (or end) of a line
a different bidi level (the purpose of this class is to avoid involving the bidi reordering when measuring lines)
the inside of a box with non-zero margin, border or padding
the outside of such a box
an inline object (e.g. image)
a TCY element
the outside or inside of a warichu element

The class of a grapheme cluster occurrence could also be specified explicitly by markup, but it can often be determined from the characters composing the grapheme cluster occurrence (at which point, it is the same for all occurrences of a given grapheme cluster). That can in turn be determined from classes assigned to the characters in the grapheme cluster. Generally, the base character of a grapheme cluster determines the class of the grapheme cluster, but there are cases where the other characters "dominate" the determination: for example, <U+00A0 NO-BREAK SPACE> may be in a class, and <U+00A0 U+0301 COMBINING ACUTE> may be in a different class.

Finally, we arrive at the classes of characters. Below is a proposed assignment for the whole Unicode repertoire. This classification mostly aligns with that of JLREQ, with a few differences:

for unassigned code points (in the Unicode sense), the class is a prediction based on the likely future allocation of those code points
JLREQ simply ignores the existence of the full width characters at U+FFxx. This leads to a number of "ambiguous" characters, such as U+0041 LATIN CAPITAL LETTER A, where JLREQ says both "an occurrence of U+0041 could be in the Western class" (A.27) and "an occurrence of U+0041 could be in the Ideographic class" (A.19). In practice, authors routinely use U+0041 and U+FF21 precisely to disambiguate the class to use.
it distinguishes the class used in horizontal and in vertical texts
it distinguishes the inseparables (see below)
it uses the InDesign refinement of the opening and closing classes (square, rounded, other)

The proposed assignment also mentions the UAX50 vertical orientation property, as it is closed aligned and informs the spacing class assignment.

Ambiguous characters

While most characters are unambiguously in a class, regardless of their context, a few characters common in Japanese typography are inherently ambiguous:

U+2018 ‘ LEFT SINGLE QUOTATION MARK
U+201C “ LEFT DOUBLE QUOTATION MARK
U+00AB « LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
U+2019 ’ RIGHT SINGLE QUOTATION MARK
U+201D ” RIGHT DOUBLE QUOTATION MARK
U+00BB » RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
U+2010 ‐ HYPHEN
U+2013 – EN DASH
U+203C ‼ DOUBLE EXCLAMATION MARK
U+2047 ⁇ DOUBLE QUESTION MARK
U+2028 ⁈ QUESTION EXCLAMATION MARK
U+2049 ⁉ EXCLAMATION QUESTION MARK
U+00B7 · MIDDLE DOT
U+2022 • BULLET
U+2014 — EM DASH
U+2026 … HORIZONTAL ELLIPSIS
U+2025 ‥ TWO DOT LEADER

A possibility is to resolve those based on the locale, or their resolved script (itself determined by looking at the script of the adjacent character).

The locale method has the downside that authors are not always tagging their text appropriately (either not at all, or not carefully on punctuation).

The script method has the advantage of not requiring the author's help, and that computation is already necessary in OpenType layout engines.

Inseparables

Currently, all inseparables are lumped in a single class, and a footnote explains that the behavior inseparable/inseparable applies only to two occurrences of the same inseparable. It would be better to have separate classes for inseperables. Not only does that avoid a footnote, but it also means that one can specify different glues for e.g. ideographic/inseparable_emDash and ideographic/inseparable_twoDotLeader, or specify different glues for inseparable_emDash/inseparable_twoDotLeader and inseparable_emDash/inseparable_ellipsis.

Logical vs visual order:

It should be made clear that the practical definition of glues is in the visual space: that's why we used the terms "left" and "right".

Classes as a Unicode property

From a practical point of view, I believe that the spacing class should be part of the Unicode Character Database, as a property, just like the vertical orientation property. The main reason is that this is the most reliable way to get a something well defined (in the sense of having a definition, not necessarily in the sense of having correct values), and in sync with the Unicode repertoire. It is a relatively easy task for Unicode, as has been demonstrated with the vertical orientation property. (In fact, the very first draft of what because UAX50 included the spacing class).

It is worth nothing that such a Unicode property is only a starting point. As noted earlier, markup should always be available to influence the determination of the glue. Thus there is no need for such a Unicode property to be perfect; it does however need to be easily accessible and fairly stable.

======== Classes and glue settings

The classes are only one part of the final visual appearance: the glue settings also come into play, so it is worth discussing those a bit, as they may influence the design of the classes.

Glue settings and justification

When justifying text (a common case for body text), implementation may have to expand a glue to an arbitrary width. Consider for example a two character paragraph, with text-align-last: justify, the glue has to be (linewidth - 2em). While large glues are sometime the result of pathological conditions, they can also be explicitly intended, such as in jidori processing. Thus it is desirable to allow pretty much all glues to grow to indefinitely.

Glue settings are mostly for body text

JLREQ currently describes three glue settings (default, JIS, and book, in tables 3-5 of appendix D; they differ only on the behavior when compressing lines, but in principle different settings could also differ on natural width or when expanding lines). It seems that those setting are mostly concerned with body text, and are not appropriate for, e.g., titles. For example, the default method specifies 0 glue between paragraph (line) start and an opening bracket, and 0.5em between a closing bracket and a paragraph (line) end; for a title starting and ending with brackets, which happens to be set on two lines (centered and not justified), this assymetric can be jarring.

It would be worth having a discussion that the settings apply to body text and to mention when they are not appropriate, or even better to include setting for other other situations. The most important situation that come to mind: titles, and ruby base/ruby text.

Interchange of glue settings

The discussion so far has been about determining the classes from characters, leaving room for document styling systems (e.g. CSS) to let authors explicitly specify classes of occurrences. The classification is of course only one part of the final result, the other being the glues that result from those classes (i.e. JLREQ appendices B, D, E). It would be useful to encourage document styling systems to allow the specification the glues as well, in the documents, either in the form of selecting from a predetermined set of settings, or by completely specifying the settings (may be as delta on top of the predetermined settings).

Spacing classes and the CSS text-indent property.

With the model presented above, the CSS text-indent property is essentially an unconditional, invariable glue between to the left of the first grapheme in a paragraph. In practice, it is useful in Japanese typography to make that glue at least conditional: e.g. 1em before an ideograph, and 0.5em before an opening bracket. I think the best way forward is to recommend that for paragraphs using the spacing model discussed here, that glue be controlled by the spacing model (i.e. the mojikumi tables) and that text-indent be set to 0.

========

Columns:

code point
UAX50 vertical orientation
H: the class for horizontal text is in column 5 blank: the class for horizontal text is ideographic
V: the class for vertical text is in column 5 blank: the class for vertical text is ideographic
class

A: if the resolved script is not Hans, Hant, Jpan -> westernChar

0x000000 | R  | H | V | unknown
0x000009 | R  | H | V | tab
0x00000A | R  | H | V | lineEdge
0x00000B | R  | H | V | unknown
0x00000D | R  | H | V | lineEdge
0x00000E | R  | H | V | unknown
0x000020 | R  | H     | justifyingSpace
0x000021 | R  | H     | westernChar
0x000080 | R  | H | V | unknown
0x000085 | R  | H | V | lineEdge
0x000086 | R  | H | V | unknown
0x0000A0 | R  | H     | justifyingSpace
0x0000A1 | R  | H     | westernChar
0x0000A7 | U  | H     | westernChar
0x0000A8 | R  | H     | westernChar
0x0000A9 | U  | H     | westernChar
0x0000AA | R  | H     | westernChar
0x0000AB | R  | H | V | openingBracket_other
0x0000AC | R  | H     | westernChar
0x0000AD | R  | H | V | unknown
0x0000AE | U  | H     | westernChar
0x0000AF | R  | H     | westernChar
0x0000B0 | R  | H     | postfixedAbbrev
0x0000B1 | U  | H     | westernChar
0x0000B2 | R  | H     | westernChar
0x0000BB | R  | H | V | closingBracket_other
0x0000BC | U  | H     | westernChar
0x0000BF | R  | H     | westernChar
0x0000D7 | U  | H     | westernChar
0x0000D8 | R  | H     | westernChar
0x0000F7 | U  | H     | westernChar
0x0000F8 | R  | H     | westernChar
0x0002EA | U  | H | V | ideographic
0x0002EC | R  | H     | westernChar
0x001100 | U  | H | V | ideographic
0x001200 | R  | H     | westernChar
0x001401 | U  | H     | westernChar
0x001680 | R  | H     | westernChar
0x0018B0 | U  | H     | westernChar
0x001900 | R  | H     | westernChar
0x00200B | R  | H | V | transparent
0x00200D | R  | H | V | unknown
0x002010 | R  | H | V | hyphen_middlePunctuation
0x002014 | R  | H     | inseparable_emDash
0x002016 | U  | H     | westernChar
0x002017 | R  | H     | westernChar
0x002018 | R  | H | V | openingBracket_other
0x002019 | R  | H | V | closingBracket_other          | A
0x00201A | R  | H     | westernChar
0x00201C | R  | H | V | openingBracket_other
0x00201D | R  | H | V | closingBracket_other
0x00201E | R  | H     | westernChar
0x002020 | U  | H     | westernChar
0x002022 | R  | H     | westernChar
0x002025 | R  | H     | inseparable_twoDotLeader
0x002026 | R  | H     | inseparable_ellipsis
0x002027 | R  | H     | westernChar
0x002028 | R  | H | V | lineEdge
0x00202A | R  | H | V | unknown
0x00202F | R  | H     | westernChar
0x002030 | U  | H | V | postfixedAbbrev
0x002032 | R  | H | V | postfixedAbbrev
0x002034 | R  | H     | westernChar
0x00203B | U  | H | V | ideographic
0x00203C | U  | H | V | dividingPunctuation
0x00203D | R  | H     | westernChar
0x002042 | U  | H     | westernChar
0x002043 | R  | H     | westernChar
0x002047 | U  | H | V | dividingPunctuation
0x00204A | R  | H     | westernChar
0x002051 | U  | H     | westernChar
0x002052 | R  | H     | westernChar
0x00205F | R  | H     | westernChar
0x002060 | R  | H | V | unknown
0x002065 | U  | H | V | ideographic
0x002066 | R  | H | V | unknown
0x002070 | R  | H     | westernChar
0x0020AC | R  | H | V | prefixedAbbrev
0x0020AD | R  | H     | westernChar
0x0020DD | U  | H     | westernChar
0x0020E1 | R  | H     | westernChar
0x0020E2 | U  | H     | westernChar
0x0020E5 | R  | H     | westernChar
0x002100 | U  | H | V | ideographic
0x002102 | R  | H     | westernChar
0x002103 | U  | H | V | postfixedAbbrev
0x002104 | U  | H | V | ideographic
0x002109 | U  | H | V | postfixedAbbrev
0x00210A | R  | H     | westernChar
0x00210F | U  | H | V | ideographic
0x002110 | R  | H     | westernChar
0x002113 | U  | H | V | postfixedAbbrev
0x002114 | U  | H | V | ideographic
0x002115 | R  | H     | westernChar
0x002116 | U  | H | V | prefixedAbbrev
0x002117 | U  | H | V | ideographic
0x002118 | R  | H     | westernChar
0x00211E | U  | H | V | ideographic
0x002124 | R  | H     | westernChar
0x002125 | U  | H | V | ideographic
0x002126 | R  | H     | westernChar
0x002127 | U  | H | V | ideographic
0x002128 | R  | H     | westernChar
0x002129 | U  | H | V | ideographic
0x00212A | R  | H     | westernChar
0x00212E | U  | H | V | ideographic
0x00212F | R  | H     | westernChar
0x002135 | U  | H | V | ideographic
0x002140 | R  | H     | westernChar
0x002145 | U  | H | V | ideographic
0x00214B | R  | H     | westernChar
0x00214C | U  | H | V | ideographic
0x00214E | R  | H     | westernChar
0x00214F | U  | H | V | ideographic
0x00218A | R  | H     | westernChar
0x00218C | U  | H | V | ideographic
0x002190 | R  | H | V | ideographic
0x00221E | U  | H | V | ideographic
0x00221F | R  | H | V | ideographic
0x002234 | U  | H | V | ideographic
0x002236 | R  | H | V | ideographic
0x002300 | U  | H | V | ideographic
0x002308 | R  | H | V | ideographic
0x00230C | U  | H | V | ideographic
0x002320 | R  | H | V | ideographic
0x002324 | U  | H | V | ideographic
0x002329 | Tr | H | V | openingBracket_other
0x00232A | Tr | H | V | closingBracket_other
0x00232B | U  | H | V | ideographic
0x00232C | R  | H | V | ideographic
0x00237D | U  | H | V | ideographic
0x00239B | R  | H | V | ideographic
0x0023BE | U  | H | V | ideographic
0x0023CE | R  | H | V | ideographic
0x0023CF | U  | H | V | ideographic
0x0023D0 | R  | H | V | ideographic
0x0023D1 | U  | H | V | ideographic
0x0023DC | R  | H | V | ideographic
0x0023E2 | U  | H | V | ideographic
0x002423 | R  | H     | westernChar
0x002424 | U  | H | V | ideographic
0x002500 | R  | H     | inseparable_emDash
0x002580 | R  | H     | westernChar
0x0025A0 | U  | H | V | ideographic
0x00261A | R  | H | V | ideographic
0x002620 | U  | H | V | ideographic
0x002768 | R  | H     | westernChar
0x002776 | U  | H | V | ideographic
0x002794 | R  | H | V | ideographic
0x002800 | R  | H     | westernChar
0x002900 | R  | H | V | ideographic
0x002B12 | U  | H | V | ideographic
0x002B30 | R  | H | V | ideographic
0x002B50 | U  | H | V | ideographic
0x002B5A | R  | H | V | ideographic
0x002BB8 | U  | H | V | ideographic
0x002BD2 | R  | H | V | ideographic
0x002BD3 | U  | H | V | ideographic
0x002BEC | R  | H | V | ideographic
0x002BF0 | U  | H | V | ideographic
0x002C00 | R  | H     | westernChar
0x002E80 | U  | H | V | ideographic
0x003000 | U  | H | V | fullSpace
0x003001 | Tu | H | V | comma_ideo
0x003002 | Tu | H | V | fullStop_ideo
0x003003 | U  | H | V | ideographic
0x003005 | U  | H | V | iterationMark
0x003006 | U  | H | V | ideographic
0x003008 | Tr | H | V | openingBracket_other
0x003009 | Tr | H | V | closingBracket_other
0x00300A | Tr | H | V | openingBracket_other
0x00300B | Tr | H | V | closingBracket_other
0x00300C | Tr | H | V | openingBracket_corner
0x00300D | Tr | H | V | closingBracket_corner
0x00300E | Tr | H | V | openingBracket_corner
0x00300F | Tr | H | V | closingBracket_corner
0x003010 | Tr | H | V | openingBracket_other
0x003011 | Tr | H | V | closingBracket_other
0x003012 | U  | H | V | ideographic
0x003014 | Tr | H | V | openingBracket_other
0x003015 | Tr | H | V | closingBracket_other
0x003016 | Tr | H | V | openingBracket_other
0x003017 | Tr | H | V | closingBracket_other
0x003018 | Tr | H | V | openingBracket_other
0x003019 | Tr | H | V | closingBracket_other
0x00301A | Tr | H | V | openingBracket_corner
0x00301B | Tr | H | V | closingBracket_corner
0x00301C | Tr | H | V | hyphen_other
0x00301D | Tr | H | V | openingBracket_other
0x00301E | Tr | H | V | closingBracket_other
0x003020 | U  | H | V | ideographic
0x003030 | Tr | H | V | ideographic
0x003031 | U  | H | V | ideographic
0x003033 | U  | H | V | inseparable_repeatUpper
0x003034 | U  | H | V | inseparable_repeatVoiceUpper
0x003035 | U  | H | V | inseparable_repeatLower
0x003036 | U  | H | V | ideographic
0x00303B | U  | H | V | iterationMark
0x00303C | U  | H | V | ideographic
0x003040 | U  | H | V | hiragana
0x003041 | Tu | H | V | smallKana
0x003042 | U  | H | V | hiragana
0x003043 | Tu | H | V | smallKana
0x003044 | U  | H | V | hiragana
0x003045 | Tu | H | V | smallKana
0x003046 | U  | H | V | hiragana
0x003047 | Tu | H | V | smallKana
0x003048 | U  | H | V | hiragana
0x003049 | Tu | H | V | smallKana
0x00304A | U  | H | V | hiragana
0x003063 | Tu | H | V | smallKana
0x003064 | U  | H | V | hiragana
0x003083 | Tu | H | V | smallKana
0x003084 | U  | H | V | hiragana
0x003085 | Tu | H | V | smallKana
0x003086 | U  | H | V | hiragana
0x003087 | Tu | H | V | smallKana
0x003088 | U  | H | V | hiragana
0x00308E | Tu | H | V | smallKana
0x00308F | U  | H | V | hiragana
0x003095 | Tu | H | V | smallKana
0x003097 | U  | H | V | hiragana
0x00309B | Tu | H | V | hiragana
0x00309D | U  | H | V | iterationMark
0x00309F | U  | H | V | hiragana
0x0030A0 | Tr | H | V | hyphen_katakana
0x0030A1 | Tu | H | V | smallKana
0x0030A2 | U  | H | V | katakana
0x0030A3 | Tu | H | V | smallKana
0x0030A4 | U  | H | V | katakana
0x0030A5 | Tu | H | V | smallKana
0x0030A6 | U  | H | V | katakana
0x0030A7 | Tu | H | V | smallKana
0x0030A8 | U  | H | V | katakana
0x0030A9 | Tu | H | V | smallKana
0x0030AA | U  | H | V | katakana
0x0030C3 | Tu | H | V | smallKana
0x0030C4 | U  | H | V | katakana
0x0030E3 | Tu | H | V | smallKana
0x0030E4 | U  | H | V | katakana
0x0030E5 | Tu | H | V | smallKana
0x0030E6 | U  | H | V | katakana
0x0030E7 | Tu | H | V | smallKana
0x0030E8 | U  | H | V | katakana
0x0030EE | Tu | H | V | smallKana
0x0030EF | U  | H | V | katakana
0x0030F5 | Tu | H | V | smallKana
0x0030F7 | U  | H | V | katakana
0x0030FB | U  | H | V | middleDot_middlePunctuation
0x0030FC | Tr | H | V | prolongedSoundMark
0x0030FD | U  | H | V | iterationMark
0x0030FF | U  | H | V | katakana
0x003100 | U  | H | V | ideographic
0x003127 | Tu | H | V | ideographic
0x003128 | U  | H | V | ideographic
0x0031F0 | Tu | H | V | smallKana
0x003200 | U  | H | V | ideographic
0x003300 | Tu | H | V | ideographic
0x003303 | Tu | H | V | postfixedAbbrev
0x003304 | Tu | H | V | ideographic
0x00330D | Tu | H | V | postfixedAbbrev
0x00330E | Tu | H | V | ideographic
0x003314 | Tu | H | V | postfixedAbbrev
0x003315 | Tu | H | V | ideographic
0x003318 | Tu | H | V | postfixedAbbrev
0x003319 | Tu | H | V | ideographic
0x003322 | Tu | H | V | postfixedAbbrev
0x003324 | Tu | H | V | ideographic
0x003326 | Tu | H | V | postfixedAbbrev
0x003328 | Tu | H | V | ideographic
0x00332B | Tu | H | V | postfixedAbbrev
0x00332C | Tu | H | V | ideographic
0x003336 | Tu | H | V | postfixedAbbrev
0x003337 | Tu | H | V | ideographic
0x00333B | Tu | H | V | postfixedAbbrev
0x00333C | Tu | H | V | ideographic
0x003349 | Tu | H | V | postfixedAbbrev
0x00334B | Tu | H | V | ideographic
0x00334D | Tu | H | V | postfixedAbbrev
0x00334E | Tu | H | V | ideographic
0x003351 | Tu | H | V | postfixedAbbrev
0x003352 | Tu | H | V | ideographic
0x003357 | Tu | H | V | postfixedAbbrev
0x003358 | U  | H | V | ideographic
0x003371 | U  | H | V | postfixedAbbrev
0x00337B | Tu | H | V | ideographic
0x003380 | U  | H | V | postfixedAbbrev
0x0033E0 | U  | H | V | ideographic
0x00A4D0 | R  | H     | westernChar
0x00A960 | U  | H | V | ideographic
0x00A980 | R  | H     | westernChar
0x00AC00 | U  | H | V | ideographic
0x00D800 | R  | H     | westernChar
0x00E000 | U  | H | V | ideographic
0x00FB00 | R  | H     | westernChar
0x00FE10 | U  | H | V | ideographic
0x00FE17 | U  | H | V | openingBracket_other
0x00FE18 | U  | H | V | closingBracket_other
0x00FE19 | U  | H | V | ideographic
0x00FE20 | R  | H     | westernChar
0x00FE30 | U  | H | V | inseparable_twoDotLeaderV
0x00FE31 | U  | H | V | inseparable_emDashV
0x00FE32 | U  | H | V | hyphen_middlePunctuation
0x00FE33 | U  | H | V | ideographic
0x00FE35 | U  | H | V | openingBracket_round
0x00FE36 | U  | H | V | closingBracket_round
0x00FE37 | U  | H | V | openingBracket_other
0x00FE38 | U  | H | V | closingBracket_other
0x00FE39 | U  | H | V | openingBracket_other
0x00FE3A | U  | H | V | closingBracket_other
0x00FE3B | U  | H | V | openingBracket_other
0x00FE3C | U  | H | V | closingBracket_other
0x00FE3D | U  | H | V | openingBracket_other
0x00FE3E | U  | H | V | closingBracket_other
0x00FE3F | U  | H | V | openingBracket_other
0x00FE40 | U  | H | V | closingBracket_other
0x00FE41 | U  | H | V | openingBracket_corner
0x00FE42 | U  | H | V | closingBracket_corner
0x00FE43 | U  | H | V | openingBracket_corner
0x00FE44 | U  | H | V | closingBracket_corner
0x00FE45 | U  | H | V | ideographic
0x00FE47 | U  | H | V | openingBracket_other
0x00FE48 | U  | H | V | closingBracket_other
0x00FE49 | R  | H     | westernChar
0x00FE50 | Tu | H | V | ideographic
0x00FE53 | U  | H | V | ideographic
0x00FE58 | R  | H | V | ideographic
0x00FE59 | Tr | H | V | ideographic
0x00FE5F | U  | H | V | ideographic
0x00FE63 | R  | H | V | ideographic
0x00FE67 | U  | H | V | ideographic
0x00FE70 | R  | H     | westernChar
0x00FEFF | R  | H | V | unknown
0x00FF00 | R  | H     | westernChar
0x00FF01 | Tu | H | V | dividingPunctuation
0x00FF02 | U  | H | V | ideographic
0x00FF03 | U  | H | V | prefixedAbbrev
0x00FF05 | U  | H | V | postfixedAbbrev
0x00FF06 | U  | H | V | ideographic
0x00FF08 | Tr | H | V | openingBracket_round
0x00FF09 | Tr | H | V | closingBracket_round
0x00FF0A | U  | H | V | ideographic
0x00FF0C | Tu | H | V | comma_western
0x00FF0D | R  | H | V | ideographic
0x00FF0E | Tu | H | V | fullStop_western
0x00FF0F | U  | H | V | ideographic
0x00FF1A | Tr | H | V | middleDot_colon
0x00FF1C | R  | H | V | ideographic
0x00FF1F | Tu | H | V | dividingPunctuation
0x00FF20 | U  | H | V | ideographic
0x00FF3B | Tr | H | V | openingBracket_other
0x00FF3C | U  | H | V | ideographic
0x00FF3D | Tr | H | V | closingBracket_other
0x00FF3E | U  | H | V | ideographic
0x00FF3F | Tr | H | V | ideographic
0x00FF40 | U  | H | V | ideographic
0x00FF5B | Tr | H | V | openingBracket_other
0x00FF5C | Tr | H | V | ideographic
0x00FF5D | Tr | H | V | closingBracket_other
0x00FF5E | Tr | H | V | ideographic
0x00FF5F | Tr | H | V | openingBracket_round
0x00FF60 | Tr | H | V | closingBracket_round
0x00FF61 | R  | H     | westernChar
0x00FFE0 | U  | H | V | postfixedAbbrev
0x00FFE1 | U  | H | V | prefixedAbbrev
0x00FFE2 | U  | H | V | ideographic
0x00FFE3 | Tr | H | V | ideographic
0x00FFE4 | U  | H | V | ideographic
0x00FFE5 | U  | H | V | prefixedAbbrev
0x00FFE6 | U  | H | V | ideographic
0x00FFE8 | R  | H     | westernChar
0x00FFF0 | U  | H | V | ideographic
0x00FFF9 | R  | H | V | transparent
0x00FFFC | U  | H | V | inlineObject
0x00FFFD | U  | H | V | ideographic
0x00FFFE | R  | H | V | unknown
0x010000 | R  | H     | westernChar
0x010980 | U  | H     | westernChar
0x0109A0 | R  | H     | westernChar
0x011580 | U  | H     | westernChar
0x011600 | R  | H     | westernChar
0x011A00 | U  | H | V | ideographic
0x011AB0 | R  | H     | westernChar
0x013000 | U  | H     | westernChar
0x013430 | R  | H     | westernChar
0x014400 | U  | H     | westernChar
0x014680 | R  | H     | westernChar
0x016FE0 | U  | H | V | ideographic
0x018B00 | R  | H     | westernChar
0x01B000 | U  | H | V | katakana
0x01B001 | U  | H | V | hiragana
0x01B130 | R  | H     | westernChar
0x01B170 | U  | H | V | ideographic
0x01B300 | R  | H     | westernChar
0x01D000 | U  | H     | westernChar
0x01D200 | R  | H     | westernChar
0x01D2E0 | U  | H | V | ideographic
0x01D300 | U  | H     | westernChar
0x01D380 | R  | H     | westernChar
0x01D800 | U  | H     | westernChar
0x01DAB0 | R  | H     | westernChar
0x01F000 | U  | H | V | ideographic
0x01F200 | Tu | H | V | ideographic
0x01F202 | U  | H | V | ideographic
0x01F800 | R  | H | V | ideographic
0x01F900 | U  | H | V | ideographic
0x01FA70 | R  | H     | westernChar
0x020000 | U  | H | V | ideographic
0x02FFFE | R  | H | V | unknown
0x030000 | U  | H | V | ideographic
0x03FFFE | R  | H | V | unknown
0x040000 | R  | H     | westernChar
0x0F0000 | U  | H | V | ideographic
0x0FFFFE | R  | H | V | unknown
0x100000 | U  | H | V | ideographic
0x10FFFE | R  | H | V | unknown
0x110000

===========

kidayasuo commented 3 years ago

Eric, thank you very much for your proposal.

I have several comments / questions regarding your proposal. I would greatly appreciate it if you could clarify.

grapheme cluster vs grapheme cluster occurence

What is a difference between A and A occurence in this context? (may be this is simply a novice English question)

it distinguishes the class used in horizontal and in vertical texts

What is the benefit of having different classes between horizontal and vertical? I would appreciate it if you could elaborate here a bit.

The locale method has the downside that authors are not always tagging their text appropriately (either not at all, or not carefully on punctuation).

I could not figure out what “not carefully on punctuation” means…

but it also means that one can specify different glues for e.g. ideographic/inseparable_emDash and ideographic/inseparable_twoDotLeader, or specify different glues for inseparable_emDash/inseparable_twoDotLeader and inseparable_emDash/inseparable_ellipsis.

I can see why separating each inseparable makes sense, because that way you can construct a state machine. Regarding your second reasoning, i.e. ability to define different glues, do you have concrete examples of why such ability is a good thing?

As noted earlier, markup should always be available to influence the determination of the glue. Thus there is no need for such a Unicode property to be perfect; it does however need to be easily accessible and fairly stable.

I am not sure if I understood the discussion here. There will be systems / applications where such markup is not possible (due to limitation of the underlaying engine, or limitation of the UI), and if so, existence of a markup could not be the reason why the property does not need to be perfect. I am not necessarily saying it needs to be perfect, as it can’t be.

The classes are only one part of the final visual appearance: the glue settings also come into play, so it is worth discussing those a bit, as they may influence the design of the classes.

Do you have examples of the glue design giving influence on the design of the character classes?

I think the best way forward is to recommend that for paragraphs using the spacing model discussed here, that glue be controlled by the spacing model (i.e. the mojikumi tables) and that text-indent be set to 0.

Do you suggest that there will be a part of text where this model is applied and another part of text where it is not applied, instead of defining everything in one spacing model? If you have multiple models, especially in one document, it would become harder to for example set an uniform style, e.g. indentation.

Columns: I have not yet carefully looked at the character class assignments you proposed but how did you handle JLReq classes that are dependent on the layout, such as ruby base (cl-20, 21, 22, 23, 24, 25, 28, 29, 30)?

0x000021 | R | H | westernChar 0x000080 | R | H | V | unknown

Is it correct to interpret that code points between 0x000021 and 0x000080 follow properties specified by 0x000021?

Lastly it would be great if you could provide a cross reference between JLReq classes and classes in your proposal.

Thank you!

emuller-amazon commented 3 years ago

grapheme cluster vs grapheme cluster occurence

What is a difference between A and A occurence in this context? (may be this is simply a novice English question)

In the string "moto", we have the graphemes "m", "o" and "t". The grapheme "o" occurs twice. In this particular example, there is no reason to treat the two occurrences differently, but it may be useful in other cases. For example, a U+2019 could occur (along with a U+2018) to quote some Japanese text, in which case that occurrence could be cl-02 closing bracket, and there could be another occurrence, in the same line, where it is apostrophe in an English word (e.g. in "don’t"); that other occurrence could be a cl-27 western character.

What is the benefit of having different classes between horizontal and vertical? I would appreciate it if you could elaborate here a bit.

Take the case of U+00C6 Æ. It is my understanding that when in horizontal text (or in sideways in vertical text), the most likely desired behavior is cl-27 western char, whereas when it is upright in vertical text, the most likely desired behavior is cl-19 ideographic character.

I could not figure out what “not carefully on punctuation” means…

Consider some text: [japanese]“[english]”[japanese]. If we use the locale to determine the class of the occurrences of “ and ”, then we are making a distinction between:

<span xml:lang='ja'>[japanese]<span xml:lang='en'>“[english]”</span>[japanese]</span>

and

<span xml:lang='ja'>[japanese]“<span xml:lang='en'>[english]</span>”[japanese]<span>

i.e. on whether the quotes are inside or outside the English span. I don't think we can rely an authors to master the difference between the two, especially if they use a wysiwyg editor, where the difference is difficult to "see".

I can see why separating each inseparable makes sense, because that way you can construct a state machine.

Yes. In JLREQ, right now, this is handled by a footnote. But every footnote means an "exception" in the code.

Regarding your second reasoning, i.e. ability to define different glues, do you have concrete examples of why such ability is a good thing?

More flexibility for the book designer can only be a good thing. As a concrete example, consider the pair <ideograph> <emdash>. The glue between those two can grow for justification purposes. It seems reasonable to have the glue between <vertical kana repeat mark lower half> and <emdash> grow in the same way; but because they are currently both inseparable, you can't express that (or you have to use a footnote...).

As noted earlier, markup should always be available to influence the determination of the glue. Thus there is no need for such a Unicode property to be perfect; it does however need to be easily accessible and fairly stable.

I am not sure if I understood the discussion here. There will be systems / applications where such markup is not possible (due to limitation of the underlaying engine, or limitation of the UI), and if so, existence of a markup could not be the reason why the property does not need to be perfect. I am not necessarily saying it needs to be perfect, as it can’t be.

I definitely have a bias for environments with markup (whether edited directly - e.g. HTML in a text editor - or indirectly - e.g. InDesign, with UI to effectively edit the underlying markup). But I also think it is the dominant use case (even here in github I have some markup). And I am not sure you one go very far without markup, especially in Japanese typography.

I do however think that both plain text and markup need something accessible and stable more than something perfect.

Do you have examples of the glue design giving influence on the design of the character classes?

Keeping in mind that I excluded line breaking, the classes serve only to determine what glue to use. So if you completely ignore the glue, there is no need to have classes. A bit more concrete: if Hiragana and Katakana always have the same glue behavior, then there is no need to have two different classes. When defining JLREQ, the classes are a consequence of the glues (of course, when implementing JLREQ, the glues are a consequence of the classes)

Do you suggest that there will be a part of text where this model is applied and another part of text where it is not applied, instead of defining everything in one spacing model? If you have multiple models, especially in one document, it would become harder to for example set an uniform style, e.g. indentation.

The use case I have in mind is a mixed language documents, eg. Japanese books with extensive (paragraph level) quotes in English, or dual language (e.g. left page in Japanese, right page in English).

Nobody is forced to use different mojikumi settings (or even Japanese vs. non Japanese spacing) in a document, but some people need to.

I have not yet carefully looked at the character class assignments you proposed but how did you handle JLReq classes that are dependent on the layout, such as ruby base (cl-20, 21, 22, 23, 24, 25, 28, 29, 30)?

In general, the context will first determine the class of an occurrence. When the context does not provide the answer, then we go to the mapping character → class. So when we have ...X<ruby>...</ruby>..., at the position between X and <ruby>, the class on the left comes from the class of X, and the class on the right comes from <ruby>, i.e this is a cl-27 / cl-22. In this particular case, the markup had another purpose, but one could have markup where the only effect would be to set the classes.

To be more complete, because we are talking presentation, CSS rather than HTML would be the proper place, and the default HTML stylesheet would have something like ruby { jlreq-class: cl22 }. (well; it's a bit more complicated, but you get the idea).

0x000021 | R | H | westernChar 0x000080 | R | H | V | unknown

Is it correct to interpret that code points between 0x000021 and 0x000080 follow properties specified by 0x000021?

Yes.

Eric.

PS: thank for helping with the translation

(Notes from @xfq: I made some edits to this comment to make the examples clearer.)

kidayasuo commented 3 years ago

Thank you Eric for clarification. They cleared up most of my questions.

Regarding the necessity to markup vs spacing by unicode property (which you explained that it does not need to be perfect), I always have a desire to bring up the quality of "everyday text", such as email / github (for some people :) / notes, etc. where people generate and read text every day and almost never markup. In that sense I am hoping that the layout be pretty good without any markups.

Regarding the mixed language case, do you think it is possible to cover spacing of all languages, including English, in this single model? It would make things much simpler.

kidayasuo commented 3 years ago

oh, @emuller-amazon san one more. A few people asked if the "glue" in your proposal and "glue" concept used in TeX (or variations of it) are the same thing. If there are differences a comparison might help people understand your proposal better. thanks

emuller-amazon commented 3 years ago

Re: glue.

Yes, the concept is similar to TeX, but needs some refinement for Japanese, to account for JLREQ's priority between the various pairs of classes. That is, we need some way to model the fact that the glues in, e.g. cl-04/cl-21 expands before the glues in cl-19/cl-19 (i.e the color coding in appendices D & E).

What I ended up with is that the width of a glue is a piecewise linear, monotonic, increasing, continuous function of some parameter t. This parameter t corresponds to how much the glue is compressed (t < 0) or expanded (t > 0). A t=0, the width is given by appendix B. I arbitrarily decided that the first step of expansion is between t=0 and t=1, the second step (red in App. E) is between t=1 and t=2, the third step (blue in App. E) is between t=2 and t=3, etc. Similarly for the compression and App. D.

The width of a letter is just a constant function.

Then you can sum up all the glues and letters on a line, by doing the point-wise addition of those functions, and you have the width L(t) of that line. It really captures all the (scalar) width the line could take, depending on how much you compress or expand it.

The addition of piecewise linear, monotonic, ... functions is also a piecewise linear, monotonic, ... function. To justify the line to a target width W, you just need to figure out the value of t_justified such what L(t_justified) = W, and that's easy because of the nature of the functions. Then, for each glue in the line, its width is w(t_justified).

My copy of JIS X 4051 is at the office, but IIRC, it has an appendix that has a whole bunch of equations, that correspond to appendix B, D, E of JLREQ. Those equations are a bit convoluted, but what I described is just a restatement of those equations.

There is a little complication because some of the glue are not continuous: the glue between a punctuation and line end can only be 0 or 1/2em, nothing in between. But it's only at the end of lines, so I special cased it. I think it should be possible to deal with non-continuous functions, but I have not done that yet.

All that works also well with ruby (including jukugo, and all the various overhang conditions), and I think it even works well with multiple ruby lines but I have not tried yet. Exercise left to the reader. Hint: model ruby as a graph where nodes are positions on the main line and on the ruby line(s), and arcs represent constraint of the form "this position must be to the left of that position". Compute the width of the graph much like the width of the line above, and use a point-wise max() for those nodes that have multiple incoming arcs. Solve as before. Use that t_justified for the longest path(s) in the graph, and justify/align independently the pieces that are not on the longest path (typically, ruby texts or ruby bases).

emuller-amazon commented 3 years ago

Re: plain text vs. markup

Of course, I would want something that works for plain text as well. Even with markup, there is a strong desire to minimize the quantity of markup that is needed, so there is no opposition on the goal between markup and plain text.

But we also need to realize that all the "virtual" classes of JLREQ (cl-18 math operator, etc; more or less, the classes that have to be deduced from the context) will be the result of either markup or heuristics, and for plain text we only have heuristics. What I am not excited about is spending too much time building complicated (and therefore delicate) heuristics (but somebody else can), and I would not be happy if the support of those heuristics makes the markup case more complicated. That could happen if the author of markup ends up asking himself all the time "does the heuristic work in this case, or do I need to add markup".

emuller-amazon commented 3 years ago

Re: mixed languages.

I don't think we have an effective technology for Arabic justification (yes, OpenType has the JSTF table, but I don't think it is viable). So we need to set that aside.

But other than that, all the other justification systems I know of are subsets of the Japanese case: the glue stuff I described earlier can be used to deal with space and letterspacing of Latin, which is pretty much what's used everywhere outside of Japanese/Chinese. Just rename cl-27 to "non-japanese characters" (of course, something better, but you get the idea).

w3c / jlreq

Proposal from Eric Muller: re: expanding JLReq character class to Unicode #242