w3c / font-text-cg

GitHub Pages
https://w3c.github.io/font-text-cg/
Other
28 stars 5 forks source link

UTN#11 versus OpenType Myanmar shaping #43

Open simoncozens opened 3 years ago

simoncozens commented 3 years ago

UTN#11 ("Representing Myanmar in Unicode") specifies a suggested canonical order of storing syllabic elements, as well as some fairly sensible constraints on the syllable structure. The OpenType Myanmar shaper, however, performs fairly minimal reordering - kinzi, medial ra, and pre-base vowels go before the consonant, A VBlw becomes VBlw A. OpenType also has a very loosely constrained syllabic structure.

The upshot of this is that equivalent sequences are not reordered and so produce different output:

$ hb-shape ~/work/myanmar/Noto/NSM-2.ttf -u '1000 102B 1036'
[ka=0+1124|_tall_aa=0+267|anusvara=0@206,374+0]

$ hb-shape ~/work/myanmar/Noto/NSM-2.ttf -u '1000 1036 102B'
[ka=0+1299|anusvara=0@-202,0+0|_tall_aa=0+267]

It would make sense for the shaper behaviour to match the syllable pattern of UTN11, and perform a strong canonical reordering.

tiroj commented 3 years ago

Although Microsoft opted to maintain its dedicated Myanmar shaper, my understanding is that the cluster model and reordering used is close to that of USE, and Andrew Glass was at one stage considering passing Myanmar to USE*

I sort of agree that it makes sense for a dedicated shaping engine to perform ordering according to UTN#11, but in general glyph ordering for display is often usefully less strict than character order normalisation, and a generic cluster model such as that employed by USE needs to be quite flexible. We’ve been bitten plenty of times by canonical ordering being too strict and then encountering real-world exceptions to that ordering.


*Which is why this test cluster shows up in USE presentations: image

lianghai commented 3 years ago

… The upshot of this is that equivalent sequences …

You need to explain in what sense these are “equivalent”.

simoncozens commented 3 years ago

Andrew Glass was at one stage considering passing Myanmar to USE.

Unfortunately it looks like this was tried but rejected. (https://github.com/harfbuzz/harfbuzz/pull/1773)

I say "unfortunately" because I found another discrepancy between actual usage and the Microsoft spec. The sequence medial la / medial ha does occur in Mon, but is disallowed by current shapers. This is because MS has both Medial Ha and Mon La in the same (MH) group, and only allows one consecutive MH in a cluster.

Not sure how to fix this: one option is to move medial la to its own group; another is to allow MH MH? instead of MH within the cluster definition.

A third, and potentially more future-proof, solution would be to reopen the USE/mym3 idea.

simoncozens commented 3 years ago

@ohbendy, can you definitively confirm that medial la-ha is a real thing? I only ask because in UTN11, @mhosken has [U+103E, U+1060] in a mutually exclusive "Medial H" group, just like in the Microsoft cluster definition. If la and ha can both appear in a cluster, then both sources will need to change.

ohbendy commented 3 years ago

Ha yeah I checked these recently. Apparently medial La and medial Ha have never been possible in Mon language, but Old Burmese has the sequence 1039 101C 103E (so the medial La isn't the Mon medial La encoded at 1060). However it appears that Asho Chin has the sequence 1060 103E as in the last line here:

Screenshot 2021-09-20 at 13 57 27

I also noticed the Padauk font contains that ligature as 103E_1060 (since the order of medials otherwise follows alphabetic order I wonder why it's not 1060_103E) and 103D_1060; I'm not certain which language has that sequence.

We also find 103D 103D in the Tai languages of Northeast India and Northwest Burma, since 103D occurs as a vowel sign in those languages, and can be reduplicated.

ohbendy commented 3 years ago

Also just checked UTN11 version 5 which Martin sent me last year for comments. Here we find:

Screenshot 2021-09-20 at 14 09 18

And for Asho Chin: Screenshot 2021-09-20 at 14 09 51

It's odd to me that the medial La gets stored after the Wa or Ha, that doesn't follow alphabetic order and I'd bet linguistically it's not strictly correct either.

simoncozens commented 3 years ago

Excellent, thanks. I'm going to raise a query/issue in MicrosoftTypography; will fix in Harfbuzz too.

simoncozens commented 3 years ago

Harfbuzz now supports medial ha - medial la. :-)