[META] Reorganize character classes and its adoption of Unicode based definition

himorin commented 4 years ago

During Jan 2020 JL-TF F2F meeting, reorganization and upgrade of character class were chatted (email in Japanese). After a while, we have developed a list of possible discussion and research items as below. This issue is a META issue to track activities (incl. sub-issues) and possible action items. Pointers to discussions and important inputs (or summaries) are added to the bottom of this initial comment. During coordination, it was pointed that updating JLreq WG Note itself via individual issues/PRs is not a good plan to go forward, but having a separate document for next character class makes our works easier. (detailed plan will be proposed by @kidayasuo ).

adoption of Unicode based definition : could reach to answer while discussion of 3+ points? a. Relationship between CL-01/02 and Pi,Ps/Pe,Pf, including width of punctuation marks b. How it can be expanded for generic character classes like CL-19 or 27 c. Character classes which are difficult to be defined using Unicode properties, like unit marks, and its redefinition using Unicode line breaking properties
Reorganization (removal and/or unity) of JLreq character classes a. JWT report (2019/3) by Binn-sensei, and removal of character classes defined by context by Kida-san b. Also refer InDesign character class, and issues from over-simplification of Illustrator
Inclusion/Separation from function point of view for marks : toward guidelines of correct orthography a. U+2016, U+30A0 b. box drawing characters c. marks which are not simply categorized as Kanji or Kana d. Composite glyph like "ヨリ", "コト", etc. e. U+301C / WaveDash, U+FF5E / FullWidthTilda, U+30FC / Katakana-Hiragana Prolonged Sound Mark
Horizontal/Vertical glyph of marks, having multiple code points by vertical and horizontal use, U+FFXX : combination to font/CSS related issues a. "く" characters, double minute (Double Vertical Line), U+2702 b. Makeshift plan by Murata-san, Text Shaping WG
Half and Full width, what is the real meaning of punctuation marks : started from how to handle spacing in punctuation marks a. How to describe removal of spacing in punctuation marks, and how to extend into Unicode b. In real font designs, most of punctuation marks are implemented as full width, and need to consider of more appropriate description to be included in JLreq as Note(?) c. In reality, EAW=Narrow is not half width, and do we have new sweat spot for description of character alignment in mixed text
Guideline to full width implementation for characters whose language does not have concept of "full" width a. Possible guideline based on full width European characters to be used only in Japanese text b. How to handle/upgrade groups of characters which are usually implemented as full-width in Japanese fonts, like Greek or Cyril

JL-TF meetings (agenda, notes, todos)

2020-10-20 (1st)
- agenda, minutes English / Japanese
- issues and actions from meeting: https://github.com/w3c/jlreq/issues/241
- todo
- Nat to write up a note describing issues (or discrepancy it would cause) by implementing JLReq layout in a globalized layout engine.
- Kida to translated the note from Eric Muller regarding expanding JLReq character classes to Unicode.
- Binn-sensei to write an essay on history of typesetting for Kanji numeral (to be as printed media)

Inputs in mail list or github

Inputs from Eric Muller, 2020/Oct/18, github issue #242, Japanese version by Kida-san

himorin commented 4 years ago

For point 2.a, Binn-sensei pointed two possible ways of baselines as:

a. Define 'Character class' to describe general layout method of individual character, and not to include ones for specific formatting (like ruby) which should be described in each formatting method b. Layout between text block with specific layout method and ones before/after, could differ from layout of their original character class. Keep having additional 'Character class' to be used for definition of such points.

kidayasuo commented 4 years ago

Shimono-san, thank you very much for bringing this up and making a summary.

Expanding JLReq to Unicode, or in more generic sense making JLReq interoperable with Unicode, I think is the biggest challenge in bringing JLReq to the next level (or the next major version). It is about making it future compatible.

It is rather a complex task as Shimono-san outlined. JLReq's character class is a combination of static property of characters and the context, where the character is used. We need to separate the context from the static property. It is a major architectural conversion which requires many rewrites.

and then we would re-define JLReq character classes using Unicode character properties. There might be cases where the current Unicode property is not sufficient to differentiate necessary behaviours.

In the process we might find cases where JLReq can be simplified (especially because the next major version will be devoted to digital text). Also in the process I believe there will be cases where we need clearer ideas on how each character, especially symbols, are to be used. It will lead to some guideline-ish description in the document (for this we need to be careful because we are not in the position of defining orthography of the language)

kidayasuo commented 4 years ago

I proposed an online meeting to discuss over Bin-sensei's proposal for separating the context classes.

macnmm commented 4 years ago

One issue I see with the idea of adopting Unicode Character Property was a descriptor for use in JLReq:

JLReq mojikumi classes (and JIS X 4051 mojikumi classes) are a grouping of characters according to spacing convention and the need to differentiate spacing rules among characters that are the same semantic type, e.g. U+FF08（ and U+300C「. Both those characters are broadly categorized as Opening Punctuation, but the spacing rules can differ, so in mojikumi classifications they are distinct. I am not sure if the intent of this proposal is to introduce such granularity into the Unicode Character Property just for the sake of supporting Japanese publishing spacing rules, but if not, then I think conversion to using them in JLReq will be a lossy conversion. Unicode unification of punctuation and certain Latin and Cyrillic and Greek characters to one code point, whereas historically in Japanese fonts such characters were distinct (and their encoding in SJIS distinct from that in ASCII), has caused a similar lossy problem when composing text in various Japanese fonts of different vintages. Some fonts have U+201C ” as a full-width SJIS-like glyph, others treat that codepoint as proportional, and the mojikumi spacing rules are different (the classes are different), yet cannot be expressed in Unicode alone.

xfq commented 3 years ago

Summary of the 2020-10-20 meeting: English / Japanese

xfq commented 3 years ago

Before modifying the jlreq text, we may need to be clear if our plan is to modify the existing jlreq, or to rewrite jlreq to be digital-oriented and international-oriented and use the new Unicode based definition here. If the latter is the goal, what is the general structure of the new document? (Maybe a separate issue is needed.)

himorin commented 3 years ago

Before modifying the jlreq text, we may need to be clear if our plan is to modify the existing jlreq, or to rewrite jlreq to be digital-oriented and international-oriented and use the new Unicode based definition here.

For this we target to rewrite JLreq as international-oriented definitions, also at least this modification need to be considered as next-edition (or amendment at least). For digital-oriented,, some sort of items are related but there is no solid (or even rough) idea for now.

If the latter is the goal, what is the general structure of the new document? (Maybe a separate issue is needed.)

For this META issue, no large restructure is in plan. We suppose this will introduce modification to some (sub-)sections of main text and appendix.

macnmm commented 3 years ago

It is rather a complex task as Shimono-san outlined. JLReq's character class is a combination of static property of characters and the context, where the character is used. We need to separate the context from the static property. It is a major architectural conversion which requires many rewrites.

I agree that one way we can make the classifications of characters in JLReq more compatible with those in Unicode is to separate the static property so that the static properties necessary for Japanese layout can be expressed in more universal (or Unicode-compatible) terms. This would seem to mean the Unicode terms should be expanded to include such nuances for Japanese, and then for other languages with specific or unique layout rules as well.

As to revising the description of how the contextual nature of Japanese layout rules work, that would seem something that can be expressed in JLReq similarly to how they are already for traditional printing and book typography. We can expand their scope into dynamic digital layout, expressing to an international audience what informs the practices of experts of Japanese layout in any medium, for example, the role of white space between characters, between lines, and flowing around objects as it relates to text.

asmusf commented 3 years ago

I admit I have not read this w/ enough detail, but skimming this discussion it occurred to me that the problem is similar on a meta level to the Unicode IndicSyllabicCategories and IndicPositionalCategories, and of course the Unicode Vertical text properties.

How about proposing a set of categories to Unicode, with some defined as "derived" (some algorithmic combination of existing Unicode properties) and some explicitly assigned as a - more fine grained - override).

Done that way, the end result would be a reliance on formal Unicode properties, but also, inside Unicode, the established derivations would surface any changes that might be (inadvertently) introduced by changes in the underlying Unicode properties (like general category or line break). If such properties must change in Unicode for some reason, it would be possible to adjust the derivation or attach an override to keep the layout properties unchanged. On the contrary, if/when the layout properties need to be changed/corrected, that can be done by changing a derivation, changing and override or changing an underlying Unicode property (if appropriate).

Getting this done may require that a Unicode technical report draft is created that defines the relation between standard Unicode properties and the (partially derived) layout properties.

xfq commented 3 years ago

@asmusf It is indeed a good suggestion for the task force to consider (although strictly speaking it's out of the scope of "layout requirements"), and there are similar suggestions in other issues as well (see the "Classes as a Unicode property" section in #242). I think we can discuss this idea in future JLReq meetings (and/or in GitHub).

himorin commented 3 years ago

existing issues:

https://github.com/w3c/jlreq/issues/166 (same as this)
https://github.com/w3c/jlreq/issues/185 (code point vs display for punctuation)

w3c / jlreq

[META] Reorganize character classes and its adoption of Unicode based definition #240