[Arabic] Perform mark reordering earlier?

adrianwong commented 3 years ago

Per UTR53, Arabic-specific mark reordering is meant to occur after Unicode (NFD) normalisation. Would it make sense to move the mark reordering step to earlier in the shaping process, i.e. prior to any feature application? (I think that would effectively make it the 1st of the 7 stages described here.

n8willis commented 2 years ago

Having just reread through HarfBuzz's normalization, I'm not convinced that moving AMTRA reordering up would solve a known problem, but it seems conceivable that it could alter the current behavior.

Basically, it MUST be done before GPOS (which is the obvious one), but even though it's unexpected to interact with GSUB in the Microsoft-script-dev description, those are real terse.

However, HarfBuzz gives wide latitude to what people might do in ccmp, and in part that's because of usages like decomposing ijam after NFD to allow for more precise placement with regard to marks — and several designers I've talked to say they do this. So that (stage 1) seems fairly important to leave in first position. Stage 2 (joining state) doesn't interact, nor would stch (stage 3) unless something odd is happening with the abbreviation mark.

GSUB in stage 4–5 starts to get murkier, though. If a font actually used mset it could interact; possibly so could some others, like switching to an alternate form.... And possibly including some actual mark-reordering. So moving the AMTRA reordering past stages 4 or 5 might change behavior vs HarfBuzz....

Not saying that for sure; I'm looking into it.

It does seem like Unicode wants to preserve a distinction between this mark reordering — being a "transient" operation for rendering purposes — and the existing normalization algorithms. As in this thread: https://github.com/w3c/alreq/issues/143

khaledhosny commented 2 years ago

HarfBuzz reorders the marks during normalization, before applying any OpenType features:

$ hb-shape -V /dev/null -u 0651,064E
trace: start reorder    buffer: <U+0651=0|U+064E=0>
trace: end reorder  buffer: <U+0651=0|U+064E=0>
trace: start table GSUB buffer: [gid0=0|gid0=0]
trace: end table GSUB   buffer: [gid0=0|gid0=0]
[gid0=0@-1000,0+0|gid0=0@-1000,0+0]

$ hb-shape -V /dev/null -u 064E,0651
trace: start reorder    buffer: <U+064E=0|U+0651=0>
trace: end reorder  buffer: <U+0651=0|U+064E=0>
trace: start table GSUB buffer: [gid0=0|gid0=0]
trace: end table GSUB   buffer: [gid0=0|gid0=0]
[gid0=0@-1000,0+0|gid0=0@-1000,0+0]

This means that the glyph stream order during lookup application is always the same regardless of the input order, so e.g. fonts that has mark ligatures don’t have to match both orders.

khaledhosny commented 2 years ago

The same happens for other normalizations, like using NFC or NFD form. Fonts then can override the engine choice in ccmp or other features.

khaledhosny commented 2 years ago

I don’t think other layout engines implement UTR53, in the second string above Uniscribe will insert a dotted circle between the two marks while keeping the input order, and CoreText will happily render them in the specified order (and depending on the font, they might or might not be rendered overlapping in this case).

khaledhosny commented 2 years ago

Correction, CoreText will reorder the marks (I needed to test with the marks applied to an Arabic letter not standalone for the reordering to happen, though) and it seems to apply it early since my substitutions are also applied (they take the corrected order as input). Uniscribe/DirectWrite too will not insert dotted circle in this case but still not reordering the marks.

n8willis commented 2 years ago

Correction, CoreText will reorder the marks (I needed to test with the marks applied to an Arabic letter not standalone for the reordering to happen, though) and it seems to apply it early since my substitutions are also applied (they take the corrected order as input). Uniscribe/DirectWrite too will not insert dotted circle in this case but still not reordering the marks.

Ah; I was about to ask how one might test it if nobody else implements it....

We've always had a bit of a specifier-tension to grapple with in spots where HarfBuzz does "more than just NFD/NFC" in its normalizer. Makes perfect sense from an efficiency standpoint, but as with decomposing multi-part matras in Indic2, it is important to make sure that there's enough emphasis on what's Unicode normalization vs what's not, so that newcomers don't overlook something by assuming that their favorite Unicode library is handling things.

Anyway, so if CoreText doesn't do the reordering of the marks unless preceded by a letter, that also suggests it happens in-or-near-to normalization, because it's looking for a Starter?

khaledhosny commented 2 years ago

I don’t understand the Core Text logic, based on the base:

space: no re-ordering
U+0640: marks re-ordered but GPOS anchor positioning does not seem to be applied (very odd, but only when using hb-view and Pages, using TextEdit the positioning is fine)
U+0628: marks re-ordered and positioning is fine.

$ hb-shape Amiri-Regular.otf -u 20,064E,0651,20,0640,064E,0651,20,0628,064E,0651 --shaper=coretext --no-positions --no-clusters
[uni064E.small|uni0651|uni0628|space|uni064E.small|uni0651|uni0640.1|space|uni0651|uni064E|space]

$ hb-view Amiri-Regular.otf -u 20,064E,0651,20,0640,064E,0651,20,0628,064E,0651 --shaper=coretext

hb-view output

khaledhosny commented 2 years ago

https://twitter.com/nedley/status/1424042568791183361

ErwinDenissen commented 2 years ago

I don’t think other layout engines implement UTR53

The layout engine used within FontCreator has implemented TR53. I hope it works as expected:

Amiri

n8willis commented 2 years ago

So, based on what @khaledhosny has said regarding both HB and CoreText (as well as @ErwinDenissen ), I made an edit in #136 that I would appreciate any feedback on. Currently it's Arabic-specific, but if it looks good would be propagated to the other scripts in the shared model.

It moves the TR 53 / MCM reordering to as early as possible in this doc, hopefully maintains a clear distinction between it and generic Unicode NFD/NFC / Ccc reordering, which I believe is important. There's also an attempt to communicate a bit of the Unicode "transient"-ness as the TR itself discusses, which I hope is clear.

It does also still maintain mark-transient-reordering as distinct from ccmp, even though those are things that interact in the broad sense of "normalization", for better or worse.

Big things I would appreciate any notes on are broken cross-references on unclear wording.

n8willis commented 2 years ago

Oh, I also added another small note to the ccmp paragraph about common decompositions (which was devoid of examples before), as I gradually understand tiny bits more about Arabic font engineering. Thanks again to Khaled.

bobh0303 commented 2 years ago

HarfBuzz reorders the marks during normalization, before applying any OpenType features:
$ hb-shape -V /dev/null -u 0651,064E

Khaled, is the above use of /dev/null just an informal abbreviation of the options actually passed to hb-shape or is this a valid syntax?

khaledhosny commented 2 years ago

Khaled, is the above use of /dev/null just an informal abbreviation of the options actually passed to hb-shape or is this a valid syntax?

HarfBuzz will happily shape an empty font, and in this case since all I want is to check the re-ordering behavior which is font-independent this was enough.

n8willis commented 2 years ago

Pushed the related changes out to other scripts covered by the general-Arabic model. I would still value any feedback on these, of course, but if there are no objections I don't see a reason to hold off on merging them.

n8willis commented 2 years ago

Okay; this change set has been merged, so I will close the issue. I never did get any feedback on Syriac or other specific scripts, so if any future readers run into problems there, please feel free to open a new issue.

n8willis / opentype-shaping-documents

[Arabic] Perform mark reordering earlier? #122