Open wezm opened 3 months ago
Taking a look now! Thanks for the report & detail here; it's not a page that I think a lot of third-party readers have gone through yet....
So, just briefly, HarfBuzz merges the punc class in with the generic bases (gb), which would allow them to also match the more complex syllable expressions; it also merges U+104C-104F into a single syallable-modifier / bindu class that includes several things, like Shan tones, that are treated distinctly in the official MS / OTL docs. Those, therefore, match in expressions that are defined for the Shan tones and other modifiers, and don't match where the "symbol" class would (as standalone).
It's not clear to me yet if there is a need for that, or if it got rolled in for simplification. There are several issue threads / discussions from c. 2022 where the original Myanmar shaper in HarfBuzz was getting refined to be more robust (it was originally based on the Indic2 shaper, AIUI) and some of that work involved trying to trim down the overall number of codepoint classes, which was high in comparison to some of its neighbors.
I've found a few language sources to poke into if I can get my head around them, though. Because, to be honest, I started to wonder if Unicode really got it right with calling U+104C-F "punctuation" in the first place. HarfBuzz merging those in with syllable modifiers sounds more like a reasonable re-classification, rather than a "byte-saving optimization"....
Thanks for looking into it
Another query/thing I've run into. In the stage 2, initial reordering step some characters aren't being tagged such as those with the NUMBER
category and some punctuation like hypen and en dash.
As a side note, would it be more useful to drop the usage of the term "syllable"? E.g., in favor of something more technically precise, like "cluster"?
I probably went with syllable initially for reasons of new-reader-familiarity, but that does come at a cost....
As a side note, would it be more useful to drop the usage of the term "syllable"? E.g., in favor of something more technically precise, like "cluster"?
I don't have strong feelings one way or another but https://learn.microsoft.com/en-us/typography/script-development/myanmar#analyzing-the-characters uses "syllable clusters", "character clusters", and just plain "cluster' so perhaps cluster is the more consistent choice.
I think there's an omission in the matching rules. _sm_
is unreferenced. I think that _v_*
in:
Tcomplex= _asat_* Med Vmain Vpost* Pwo* _v_* Z?
should be something like (_v_ | _sm_)*
to give:
Tcomplex= _asat_* Med Vmain Vpost* Pwo* (_v__sm_)* Z?
For this example "င်္က္ကျြွှေို့်ာှီ့ၤဲံ့းႍ" this change would allow the last character to be matched, which it does not currently:
| U+1004 | Letter | CONSONANT | _null_ | Nga | _ra_ ⎫
| U+103A | Mark [Mn] | PURE_KILLER | TOP_POSITION | Asat | _asat_ ⎬ Kinzi (K)
| U+1039 | Mark [Mn] | INVISIBLE_STACKER | _null_ | Virama | _halant_ ⎭
| U+1000 | Letter | CONSONANT | _null_ | Ka | C
| U+1039 | Mark [Mn] | INVISIBLE_STACKER | _null_ | Virama | _halant_
| U+1000 | Letter | CONSONANT | _null_ | Ka | C
| U+103B | Mark [Mc] | CONSONANT_MEDIAL | RIGHT_POSITION | Sign Medial Ya | _my_ ⎫
| U+103C | Mark [Mc] | CONSONANT_MEDIAL | TOP_LEFT_AND_BOTTOM_POSITION | Sign Medial Ra | _mr_ ⎬ Med
| U+103D | Mark [Mn] | CONSONANT_MEDIAL | BOTTOM_POSITION | Sign Medial Wa | _mw_ ⎟
| U+103E | Mark [Mn] | CONSONANT_MEDIAL | BOTTOM_POSITION | Sign Medial Ha | _mh_ ⎭
| U+1031 | Mark [Mc] | VOWEL_DEPENDENT | LEFT_POSITION | Sign E | _matrapre_ ⎫
| U+102D | Mark [Mn] | VOWEL_DEPENDENT | TOP_POSITION | Sign I | _matraabove_ ⎟
| U+102F | Mark [Mn] | VOWEL_DEPENDENT | BOTTOM_POSITION | Sign U | _matrabelow_ ⎬ Vmain
| U+1037 | Mark [Mn] | TONE_MARKER | BOTTOM_POSITION | Dot Below | _db_ ⎟
| U+103A | Mark [Mn] | PURE_KILLER | TOP_POSITION | Asat | _asat_ ⎭
| U+102C | Mark [Mc] | VOWEL_DEPENDENT | RIGHT_POSITION | Sign Aa | _matrapost_ ⎫
| U+103E | Mark [Mn] | CONSONANT_MEDIAL | BOTTOM_POSITION | Sign Medial Ha | _mh_ ⎬ Vpost
| U+102E | Mark [Mn] | VOWEL_DEPENDENT | TOP_POSITION | Sign Ii | _matraabove_ ⎟
| U+1037 | Mark [Mn] | TONE_MARKER | BOTTOM_POSITION | Dot Below | _db_ ⎭
| U+1064 | Mark [Mc] | TONE_MARKER | RIGHT_POSITION | Tone Sgaw Karen Ke Pho | _pt_ ⎫
| U+1032 | Mark [Mn] | VOWEL_DEPENDENT | TOP_POSITION | Sign Ai | _a_ ⎟
| U+1036 | Mark [Mn] | BINDU | TOP_POSITION | Anusvara | _a_ ⎬ Pwo
| U+1037 | Mark [Mn] | TONE_MARKER | BOTTOM_POSITION | Dot Below | _db_ ⎟
| U+1038 | Mark [Mc] | VISARGA | RIGHT_POSITION | Visarga | _v_ ⎭
| U+108D | Mark [Mn] | TONE_MARKER | BOTTOM_POSITION | Sign Shan Council Emphatic Tone|
That does look correct; I am trying to untangle some other differences between the MS and HB regex categories, though. Sorry I've been less responsive here for a bit; just juggling some other things. Hope to have an update worth looking at shortly. I just don't want to mangle some of the changed bits without understanding why some of the other shapers are doing something different than they did when this was written.
Question: How does the Allsorts team view the combining of categories, in general? The fact that HarfBuzz does that is one of the reasons it can take a minute to get back up to speed when comparing its regular expressions to the MS script docs's (which don't do that).
It's certainly practical for implementers, no doubt. There might be some middle-of-the-road approach for documenting, like just combining classes that are purely sets of individual characters, but not combining sets of expressions unless they really simplify the final syllable/cluster-matching expressions.
Question: How does the Allsorts team view the combining of categories, in general?
Do you mean things like this?
_consonant_ = `CONSONANT` | `CONSONANT_PLACEHOLDER` - _ra_
and
C = _consonant_ | _ra_
If so, I'm not sure there are strong feelings one way or another. It probably does help the implementation be a bit more readable.
Yeah, a more apropos phrasing would probably be just saying "if any of the category combining causes trip-ups or is confusing, please consider that a bug". In particular, here I was wondering about the subtraction of the _ra_
class from the _consonant_
class. I think that might be the only place I attempted a "difference" operator; coming back to it after some time had elapsed I can't recall why that seemed like a good idea.
So, there are going to be a couple of changes required for sure. One is that the regular Visarga codepoint actually matches both the _v_
and _sm_
sets, which is somewhat harmless but a bit confusing for the reader. So I would just drop _v_
and put in a category match in the _sm_
definition to capture the other VISARGA
-class codepoint(s). So, in your example above, you could just use _sm_
and not worry about the OR.
The one place where I'm not sure that handling explicit Visarga with other _sm_
codepoints wouldn't cause problems is in Sanskrit, because the Vedic Extensions has some overstruck visarga-related signs, and I don't know if those are classified right in the table. Although Harfbuzz seems not to worry about that....
HarfBuzz also updated its medial logic in response to https://github.com/w3c/font-text-cg/issues/43#issuecomment-922914048 to separate Medial Mon La, which is currently grouped in with Medial Ha, but can behave differently. I think that would just be as simple as a _ml_ = U+1060
and changing Med
to _my_? _asat_? _mr_? ( (_mw_ _mh_? _ml_? | _mh_ _ml_? | _ml_) _asat_?)?
And I also think I should revisit the merging of some of the non-complex character sets; since _punc_
is unused that should be fixed etc. The MS docs have a few more things that are classified as PUNCTUATION here in with Symbols, but HarfBuzz considers them Generic Bases. Probably doesn't matter, but it might be easier reading with a little cleanup.
here I was wondering about the subtraction of the
_ra_
class from the_consonant_
class. I think that might be the only place I attempted a "difference" operator; coming back to it after some time had elapsed I can't recall why that seemed like a good idea.
I will admit that I missed the subtraction initially. Also it's a little curious that _consonant_
subtracts _ra_
but _consonant_
isn't used aside from in C
, which adds _ra_
back.
I will admit that I missed the subtraction initially. Also it's a little curious that consonant subtracts ra but consonant isn't used aside from in C, which adds ra back.
I got that initially from the Microsoft documentation and, at the time, HarfBuzz was following it perhaps a bit more closely.
I think that the intent was likely that you would need to have a different definition of "consonant" for the regular expressions than you would use within a consonant-based syllable as you're identifying the base in shaping-stage 1. But that might not be necessary anymore (and perhaps was not necessary then, either...). But I'm rereading it again now to see if I still get it. Since the Kinzi sequence is not ambiguous, the base-finding algorithm can just match it without needing different classes. The consonant-placeholders I'm not as sure about, though.
@wezm I pushed an update to the identification classes and regular expressions in stage-1, in PR #168 . When you have a moment, please take a look at let me know. I retained the _v_
class for VISARGA
and merged it in with the existing _sm_
class, rather than just adding visarga to the _sm_
group, because of the Vedic Extensions visargas.
I also added a merge class _G_
that lumps the punctuation class in with the generic bases and the digits; that's the route that HarfBuzz takes as well, and it's slightly simpler than treating the punctuation separately.... Since Microsoft has a different split of symbol vs punctuation, I figure if the simple approach works, it's worth a try.
I haven't attempted an implementation yet but the changes look good.
I'm working on updating my implementation. One thing I encountered (this isn't new) is in:
All of the left-side dependent-vowel (matra) signs matching this condition in Myanmar can be identified using the matrapre regular-expression class defined in stage 1.
_matrapre_
is defined as _matrapre_ =
MATRA&
LEFT_POSITION` however
MATRA(as a class) isn't defined anywhere. Does it equate to the
VOWEL_DEPENDENT` shaping class in the character tables? (that's what I've been doing so far).
Some more notes as I progress the implementation:
I noticed a difference between our output and Harfbuzz. This was due to Harfbuzz applying rlig
by default. I note that in the default shaping model docs this is enabled as part of the default set. Perhaps it should be added to the Myanmar docs too.
_punc_
only includes Section and Little section. Should it also include other characters in the other characters in the PUNCTUATION shaping class Unicode category listed in the Myanmar character tables? Such as:
Should characters with shaping class NUMBER
be able to be base?
Example: ႐ုံ
U+1090 is matched by (C | _vowel_ | G)
as part of G
. Given the position in the expression it seems like the intent is that all these can be the base consonant. However in the reordering step only characters with shaping class CONSONANT
are considered when determining what to assign POS_BASE_CONSONANT
to.
Some more notes as I progress the implementation:
I noticed a difference between our output and Harfbuzz. This was due to Harfbuzz applying
rlig
by default. I note that in the default shaping model docs this is enabled as part of the default set. Perhaps it should be added to the Myanmar docs too.
Yeah; if it's found in the text corpus then I'd concur that it ought to be documented (and for the other shapers, too). I think it feels off to document some of those other always-on-in-HB features as necessary, though, in particular the ones that deal with cursive styling. I guess the argument is that if the type designer puts a curs
or calt
feature in, then it's most likely necessary for getting the correct output back out.
Because rlig
and rclt
are on-by-default in the MS spec and not meant to be exposed in the UI, they pretty much have to be handled in the default path as non-optional. But it feels to me like it strays somewhat from the mission of describing how script-specific shaping is performed to simply drop in a list of features to be applied universally in just-in-case fashion. But maybe I'm overthinking it; adding a Note:
to explain it might be all that's required.
_punc_
only includes Section and Little section. Should it also include other characters in the other characters in the PUNCTUATION ~shaping class~ Unicode category listed in the Myanmar character tables? Such as:* ၌ Locative * ၍ Completed * ၎ Aforementioned * ၏ Genitive
I'm looking into this. The ၎ Aforementioned seems to be a different animal, at least in Burmese, but I'd like to get a bit of info about some of the other languages.
Should characters with shaping class
NUMBER
be able to be base?Example: ႐ုံ
* U+1090 // MYANMAR SHAN DIGIT ZERO * U+102F // MYANMAR VOWEL SIGN U, Mark, Bottom * U+1036 // MYANMAR SIGN ANUSVARA, Mark, Top
U+1090 is matched by
(C | _vowel_ | G)
as part ofG
. Given the position in the expression it seems like the intent is that all these can be the base consonant. However in the reordering step only characters with shaping classCONSONANT
are considered when determining what to assignPOS_BASE_CONSONANT
to.
This is a tricky one. I believe that the inclusion of numbers in the syllable expression is needed in order to handle ordinal-number sylllables (1st
, etc). But I wouldn't expect that a numeral would get medial or subjoined consonants attached.
I'm working on Myanmar shaping in Allsorts and have a query about how punctuation should be handled in syllable splitting. There are these punctuation characters in the Myanmar character tables but they don't seem to be matched by any rules.
U+104A
U+104B
U+104C
U+104D
U+104F
I've run my implementation against this text "ပို၍စောစီးစွာပေးပါက" and ၍ is tripping it up. It has no shaping class/rules that match it in the syllable identification details.
There are these two notes though:
and
I'm wondering how these characters should be handled, since their use doesn't feel like a broken expression?
One other note: ။ and ၊ are referenced in the non-terminal
_punc_ = "Little Section" | "Section" however
punc` does not appear to be used, wondering if that's intended?Edit: I see the following on the OpenType Myanmar page:
Which suggests ၍ and friends should be accepted as cluster by themselves.