n8willis / opentype-shaping-documents

Documentation of OpenType shaping behavior
168 stars 13 forks source link

[Myanmar] Syllable matching and punctuation #164

Open wezm opened 3 months ago

wezm commented 3 months ago

I'm working on Myanmar shaping in Allsorts and have a query about how punctuation should be handled in syllable splitting. There are these punctuation characters in the Myanmar character tables but they don't seem to be matched by any rules.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+104A Punctuation null null ၊ Little Section
U+104B Punctuation null null ။ Section
U+104C Punctuation null null ၌ Locative
U+104D Punctuation null null ၍ Completed
U+104F Punctuation null null ၏ Genitive

I've run my implementation against this text "ပို၍စောစီးစွာပေးပါက" and ၍ is tripping it up. It has no shaping class/rules that match it in the syllable identification details.

There are these two notes though:

Assigned codepoints with a null in the Shaping class column evoke no special behavior from the shaping engine.

and

A sequence that does not match any of these expressions should be regarded as broken. The shaping engine may make a best-effort attempt to shape the broken sequence, but making guarantees about the correctness or appearance of the final result is out of scope for this document.

I'm wondering how these characters should be handled, since their use doesn't feel like a broken expression?

One other note: ။ and ၊ are referenced in the non-terminal _punc_ = "Little Section" | "Section" howeverpunc` does not appear to be used, wondering if that's intended?

Edit: I see the following on the OpenType Myanmar page:

Simple non-compounding cluster

<P | S | R | WJ| WS | O | D0 >

Punctuation (P), symbols (S), reserved characters from the Myanmar block (R), word joiner (WJ), white space (WS), and other SCRIPT_COMMON charcters (O) contain one character per cluster.

Which suggests ၍ and friends should be accepted as cluster by themselves.

n8willis commented 3 months ago

Taking a look now! Thanks for the report & detail here; it's not a page that I think a lot of third-party readers have gone through yet....

n8willis commented 3 months ago

So, just briefly, HarfBuzz merges the punc class in with the generic bases (gb), which would allow them to also match the more complex syllable expressions; it also merges U+104C-104F into a single syallable-modifier / bindu class that includes several things, like Shan tones, that are treated distinctly in the official MS / OTL docs. Those, therefore, match in expressions that are defined for the Shan tones and other modifiers, and don't match where the "symbol" class would (as standalone).

It's not clear to me yet if there is a need for that, or if it got rolled in for simplification. There are several issue threads / discussions from c. 2022 where the original Myanmar shaper in HarfBuzz was getting refined to be more robust (it was originally based on the Indic2 shaper, AIUI) and some of that work involved trying to trim down the overall number of codepoint classes, which was high in comparison to some of its neighbors.

I've found a few language sources to poke into if I can get my head around them, though. Because, to be honest, I started to wonder if Unicode really got it right with calling U+104C-F "punctuation" in the first place. HarfBuzz merging those in with syllable modifiers sounds more like a reasonable re-classification, rather than a "byte-saving optimization"....

wezm commented 3 months ago

Thanks for looking into it

wezm commented 3 months ago

Another query/thing I've run into. In the stage 2, initial reordering step some characters aren't being tagged such as those with the NUMBER category and some punctuation like hypen and en dash.

n8willis commented 3 months ago

As a side note, would it be more useful to drop the usage of the term "syllable"? E.g., in favor of something more technically precise, like "cluster"?

I probably went with syllable initially for reasons of new-reader-familiarity, but that does come at a cost....

wezm commented 3 months ago

As a side note, would it be more useful to drop the usage of the term "syllable"? E.g., in favor of something more technically precise, like "cluster"?

I don't have strong feelings one way or another but https://learn.microsoft.com/en-us/typography/script-development/myanmar#analyzing-the-characters uses "syllable clusters", "character clusters", and just plain "cluster' so perhaps cluster is the more consistent choice.

wezm commented 2 months ago

I think there's an omission in the matching rules. _sm_ is unreferenced. I think that _v_* in:

Tcomplex= _asat_* Med Vmain Vpost* Pwo* _v_* Z?

should be something like (_v_ | _sm_)* to give:

Tcomplex= _asat_* Med Vmain Vpost* Pwo* (_v__sm_)* Z?

For this example "င်္က္ကျြွှေို့်ာှီ့ၤဲံ့းႍ" this change would allow the last character to be matched, which it does not currently:

        | U+1004 | Letter    | CONSONANT         | _null_                       | Nga                    |  _ra_         ⎫
        | U+103A | Mark [Mn] | PURE_KILLER       | TOP_POSITION                 | Asat                   |  _asat_       ⎬ Kinzi (K)
        | U+1039 | Mark [Mn] | INVISIBLE_STACKER | _null_                       | Virama                 |  _halant_     ⎭
        | U+1000 | Letter    | CONSONANT         | _null_                       | Ka                     |  C
        | U+1039 | Mark [Mn] | INVISIBLE_STACKER | _null_                       | Virama                 |  _halant_
        | U+1000 | Letter    | CONSONANT         | _null_                       | Ka                     |  C
        | U+103B | Mark [Mc] | CONSONANT_MEDIAL  | RIGHT_POSITION               | Sign Medial Ya         |  _my_         ⎫
        | U+103C | Mark [Mc] | CONSONANT_MEDIAL  | TOP_LEFT_AND_BOTTOM_POSITION | Sign Medial Ra         |  _mr_         ⎬ Med
        | U+103D | Mark [Mn] | CONSONANT_MEDIAL  | BOTTOM_POSITION              | Sign Medial Wa         |  _mw_         ⎟
        | U+103E | Mark [Mn] | CONSONANT_MEDIAL  | BOTTOM_POSITION              | Sign Medial Ha         |  _mh_         ⎭
        | U+1031 | Mark [Mc] | VOWEL_DEPENDENT   | LEFT_POSITION                | Sign E                 |  _matrapre_   ⎫
        | U+102D | Mark [Mn] | VOWEL_DEPENDENT   | TOP_POSITION                 | Sign I                 |  _matraabove_ ⎟
        | U+102F | Mark [Mn] | VOWEL_DEPENDENT   | BOTTOM_POSITION              | Sign U                 |  _matrabelow_ ⎬ Vmain
        | U+1037 | Mark [Mn] | TONE_MARKER       | BOTTOM_POSITION              | Dot Below              |  _db_         ⎟
        | U+103A | Mark [Mn] | PURE_KILLER       | TOP_POSITION                 | Asat                   |  _asat_       ⎭
        | U+102C | Mark [Mc] | VOWEL_DEPENDENT   | RIGHT_POSITION               | Sign Aa                |  _matrapost_  ⎫
        | U+103E | Mark [Mn] | CONSONANT_MEDIAL  | BOTTOM_POSITION              | Sign Medial Ha         |  _mh_         ⎬ Vpost
        | U+102E | Mark [Mn] | VOWEL_DEPENDENT   | TOP_POSITION                 | Sign Ii                |  _matraabove_ ⎟
        | U+1037 | Mark [Mn] | TONE_MARKER       | BOTTOM_POSITION              | Dot Below              |  _db_         ⎭
        | U+1064 | Mark [Mc] | TONE_MARKER       | RIGHT_POSITION               | Tone Sgaw Karen Ke Pho |  _pt_         ⎫
        | U+1032 | Mark [Mn] | VOWEL_DEPENDENT   | TOP_POSITION                 | Sign Ai                |  _a_          ⎟
        | U+1036 | Mark [Mn] | BINDU             | TOP_POSITION                 | Anusvara               |  _a_          ⎬ Pwo
        | U+1037 | Mark [Mn] | TONE_MARKER       | BOTTOM_POSITION              | Dot Below              |  _db_         ⎟
        | U+1038 | Mark [Mc] | VISARGA           | RIGHT_POSITION               | Visarga                |  _v_          ⎭
        | U+108D | Mark [Mn] | TONE_MARKER       | BOTTOM_POSITION              | Sign Shan Council Emphatic Tone|
n8willis commented 2 months ago

That does look correct; I am trying to untangle some other differences between the MS and HB regex categories, though. Sorry I've been less responsive here for a bit; just juggling some other things. Hope to have an update worth looking at shortly. I just don't want to mangle some of the changed bits without understanding why some of the other shapers are doing something different than they did when this was written.

n8willis commented 2 months ago

Question: How does the Allsorts team view the combining of categories, in general? The fact that HarfBuzz does that is one of the reasons it can take a minute to get back up to speed when comparing its regular expressions to the MS script docs's (which don't do that).

It's certainly practical for implementers, no doubt. There might be some middle-of-the-road approach for documenting, like just combining classes that are purely sets of individual characters, but not combining sets of expressions unless they really simplify the final syllable/cluster-matching expressions.

wezm commented 2 months ago

Question: How does the Allsorts team view the combining of categories, in general?

Do you mean things like this?

_consonant_     = `CONSONANT` | `CONSONANT_PLACEHOLDER` - _ra_

and 

C   = _consonant_ | _ra_

If so, I'm not sure there are strong feelings one way or another. It probably does help the implementation be a bit more readable.

n8willis commented 2 months ago

Yeah, a more apropos phrasing would probably be just saying "if any of the category combining causes trip-ups or is confusing, please consider that a bug". In particular, here I was wondering about the subtraction of the _ra_ class from the _consonant_ class. I think that might be the only place I attempted a "difference" operator; coming back to it after some time had elapsed I can't recall why that seemed like a good idea.

n8willis commented 2 months ago

So, there are going to be a couple of changes required for sure. One is that the regular Visarga codepoint actually matches both the _v_ and _sm_ sets, which is somewhat harmless but a bit confusing for the reader. So I would just drop _v_ and put in a category match in the _sm_ definition to capture the other VISARGA-class codepoint(s). So, in your example above, you could just use _sm_ and not worry about the OR.

The one place where I'm not sure that handling explicit Visarga with other _sm_ codepoints wouldn't cause problems is in Sanskrit, because the Vedic Extensions has some overstruck visarga-related signs, and I don't know if those are classified right in the table. Although Harfbuzz seems not to worry about that....

HarfBuzz also updated its medial logic in response to https://github.com/w3c/font-text-cg/issues/43#issuecomment-922914048 to separate Medial Mon La, which is currently grouped in with Medial Ha, but can behave differently. I think that would just be as simple as a _ml_ = U+1060 and changing Med to _my_? _asat_? _mr_? ( (_mw_ _mh_? _ml_? | _mh_ _ml_? | _ml_) _asat_?)?

And I also think I should revisit the merging of some of the non-complex character sets; since _punc_ is unused that should be fixed etc. The MS docs have a few more things that are classified as PUNCTUATION here in with Symbols, but HarfBuzz considers them Generic Bases. Probably doesn't matter, but it might be easier reading with a little cleanup.

wezm commented 2 months ago

here I was wondering about the subtraction of the _ra_ class from the _consonant_ class. I think that might be the only place I attempted a "difference" operator; coming back to it after some time had elapsed I can't recall why that seemed like a good idea.

I will admit that I missed the subtraction initially. Also it's a little curious that _consonant_ subtracts _ra_ but _consonant_ isn't used aside from in C, which adds _ra_ back.

n8willis commented 1 month ago

I will admit that I missed the subtraction initially. Also it's a little curious that consonant subtracts ra but consonant isn't used aside from in C, which adds ra back.

I got that initially from the Microsoft documentation and, at the time, HarfBuzz was following it perhaps a bit more closely.

I think that the intent was likely that you would need to have a different definition of "consonant" for the regular expressions than you would use within a consonant-based syllable as you're identifying the base in shaping-stage 1. But that might not be necessary anymore (and perhaps was not necessary then, either...). But I'm rereading it again now to see if I still get it. Since the Kinzi sequence is not ambiguous, the base-finding algorithm can just match it without needing different classes. The consonant-placeholders I'm not as sure about, though.

n8willis commented 1 month ago

@wezm I pushed an update to the identification classes and regular expressions in stage-1, in PR #168 . When you have a moment, please take a look at let me know. I retained the _v_ class for VISARGA and merged it in with the existing _sm_ class, rather than just adding visarga to the _sm_ group, because of the Vedic Extensions visargas.

I also added a merge class _G_ that lumps the punctuation class in with the generic bases and the digits; that's the route that HarfBuzz takes as well, and it's slightly simpler than treating the punctuation separately.... Since Microsoft has a different split of symbol vs punctuation, I figure if the simple approach works, it's worth a try.

wezm commented 3 weeks ago

I haven't attempted an implementation yet but the changes look good.

wezm commented 3 weeks ago

I'm working on updating my implementation. One thing I encountered (this isn't new) is in:

All of the left-side dependent-vowel (matra) signs matching this condition in Myanmar can be identified using the matrapre regular-expression class defined in stage 1.

_matrapre_ is defined as _matrapre_ =MATRA&LEFT_POSITION` howeverMATRA(as a class) isn't defined anywhere. Does it equate to theVOWEL_DEPENDENT` shaping class in the character tables? (that's what I've been doing so far).

wezm commented 2 weeks ago

Some more notes as I progress the implementation:

I noticed a difference between our output and Harfbuzz. This was due to Harfbuzz applying rlig by default. I note that in the default shaping model docs this is enabled as part of the default set. Perhaps it should be added to the Myanmar docs too.

_punc_ only includes Section and Little section. Should it also include other characters in the other characters in the PUNCTUATION shaping class Unicode category listed in the Myanmar character tables? Such as:

wezm commented 2 weeks ago

Should characters with shaping class NUMBER be able to be base?

Example: ႐ုံ

U+1090 is matched by (C | _vowel_ | G) as part of G. Given the position in the expression it seems like the intent is that all these can be the base consonant. However in the reordering step only characters with shaping class CONSONANT are considered when determining what to assign POS_BASE_CONSONANT to.

n8willis commented 4 days ago

Some more notes as I progress the implementation:

I noticed a difference between our output and Harfbuzz. This was due to Harfbuzz applying rlig by default. I note that in the default shaping model docs this is enabled as part of the default set. Perhaps it should be added to the Myanmar docs too.

Yeah; if it's found in the text corpus then I'd concur that it ought to be documented (and for the other shapers, too). I think it feels off to document some of those other always-on-in-HB features as necessary, though, in particular the ones that deal with cursive styling. I guess the argument is that if the type designer puts a curs or calt feature in, then it's most likely necessary for getting the correct output back out.

Because rlig and rclt are on-by-default in the MS spec and not meant to be exposed in the UI, they pretty much have to be handled in the default path as non-optional. But it feels to me like it strays somewhat from the mission of describing how script-specific shaping is performed to simply drop in a list of features to be applied universally in just-in-case fashion. But maybe I'm overthinking it; adding a Note: to explain it might be all that's required.

n8willis commented 4 days ago

_punc_ only includes Section and Little section. Should it also include other characters in the other characters in the PUNCTUATION ~shaping class~ Unicode category listed in the Myanmar character tables? Such as:

* ၌ Locative

* ၍ Completed

* ၎ Aforementioned

* ၏ Genitive

I'm looking into this. The ၎ Aforementioned seems to be a different animal, at least in Burmese, but I'd like to get a bit of info about some of the other languages.

n8willis commented 3 days ago

Should characters with shaping class NUMBER be able to be base?

Example: ႐ုံ

* U+1090 // MYANMAR SHAN DIGIT ZERO

* U+102F // MYANMAR VOWEL SIGN U, Mark, Bottom

* U+1036 // MYANMAR SIGN ANUSVARA, Mark, Top

U+1090 is matched by (C | _vowel_ | G) as part of G. Given the position in the expression it seems like the intent is that all these can be the base consonant. However in the reordering step only characters with shaping class CONSONANT are considered when determining what to assign POS_BASE_CONSONANT to.

This is a tricky one. I believe that the inclusion of numbers in the syllable expression is needed in order to handle ordinal-number sylllables (1st, etc). But I wouldn't expect that a numeral would get medial or subjoined consonants attached.