[Indic] Input masks for features

simoncozens commented 4 years ago

We need to be a bit careful with wording for things like this:

The rkrf feature replaces "Consonant,Halant,Ra" sequences with the "Rakaar"-ligature form of the consonant glyph.

(and all the other feature descriptions). The rkrf feature does whatever the heck lookups the font designer has put in the rkrf feature; if they want to use it for conjuncts or presentation forms, that's up to them.

The MS shaper pages sometimes specify input contexts for these features (so for rkrf it says "The input context for the rakaar feature always consists of the full form consonant + halant + Ra.") but the OT spec does not. And Harfbuzz implements masking for some for the features (blwf, half, pref, for example) but not others (nukt, ahkn, and indeed rkrf) - it processes the full buffer for these latter features. So I think some more clarity is needed here.

But it's also not clear to me, if the features are implemented in a font are "as intended" (e.g. half used for half forms, as opposed to the designer shoving in whatever substitutions they like), whether input context masks are actually needed for correct shaping. For example, consonant-halant is always going to be pre-base, so

feature half { sub dvKA dvVirama by dvK; } half;

will only match pre-base consonants whether or not the shaping engine restricts the context using a mask.

dscorbett commented 4 years ago

Consonant–halant is not always pre-base. For example, it can appear at the end of a word or before a ZWNJ. So the substitution does depend on context. Context masks are therefore needed for correct shaping in the Indic shaper.

simoncozens commented 4 years ago

Thanks - that's helpful; I didn't spot the final_halant_group because the C and the H were in different rules! So... all the more reason to be absolutely clear on what the context masks are.

n8willis commented 4 years ago

The rkrf feature does whatever the heck lookups the font designer has put in the rkrf feature; if they want to use it for conjuncts or presentation forms, that's up to them.

Well, I guess I fundamentally don't agree with taking this approach to describing features since, by that logic, none of the features or glyph classes would have a meaning. (E.g., if the font designer wants, all of the consonant GlyphIDs could be empty, marks in the TOP_POSITION subclass could be positioned twelve em-squares below the baseline, the Yaphala substitution could use Latin letters, etc etc. — there'd literally be no end to those possibilities.)

I feel like the words of caution you allude to would be appropriate in another context, such as more general "advice for developers doing text processing or handling font binaries" or something. But this repository is meant to be a "description of behavior" so I think it's appropriate to focus on what the behavior is intended to be. Saying anything-can-hypothetically-do-anything in an effort to draw a circle around all possible cases would leave the descriptions not really saying anything.

n8willis commented 4 years ago

That (preceding comment) having been said, the need to determine/say where input contexts are mandatory as an issue is certainly a valid one.

n8willis commented 3 years ago

Here's a list of the "context statements" from the Indic2 scripts, as gleaned in a rather naïve scour of the Microsoft docs. Not all are described this way in every script, and there are a handful of divergences. Where the wording is not duplicated, that just means the wording was the same as in the previous example.

It also doesn't include sort-of-but-not-quite references to input context. For example, the abvm and blwm features are always described with "The best method for encoding this feature in an OpenType font is to use a chaining context positioning lookup that triggers mark-to-base and mark-to-mark attachments for above-base marks" (or "below-base marks"), which suggests some context but is not precise.

<bng2> nukt: The input context for the nukt feature always consists of the full form of the consonant. akhn: The input context for the akhand feature always consists of the full form of the consonant. rphf: The input context for the Reph feature always consists of the full form of Ra + Halant. vatu: The input context for the 'vatu' feature consists of a consonant (in full or half form) + vattu glyph. init: All initial forms must be based on an input context consisting of the full form of consonants.

<dev2> nukt akhn rphf rkrf: The input context for the rakaar feature always consists of the full form consonant + halant + Ra.

<gjr2> nukt akhn rphf rkrf

<gur2> nukt akhn rphf

<knd2> nukt akhn rphf

<mlm2> nukt akhn rphf blwf: The input context for the below-base feature consists of Halant + La, preceded by a consonant. pstf: The input context for the post-base feature consists of Halant + Consonant, preceded by the base glyph.

<ory2> nukt akhn rphf blwf: The input context for the below-base feature consists of Halant + Consonant, preceded by a consonant.

<tml2> nukt akhn rphf

<tel2> nukt akhn rphf

<sinh> (none)

behdad commented 3 years ago

I didn't spot the final_halant_group because the C and the H were in different rules!

It's not just the final halant.

Basically, Indic text looks like: C,H,C,H,...,C,H,C. One of these is considered Base. The C,H'es that come before base form half forms. The ones after Base form below or post forms. Now. I do believe ALL of this could be implemented without much shaper logic IF reverse contextual lookups were possible (the simple one in GSUB doesn't count since it can't ligate). If lookup-direction flags are implemented, that still would be possible. One would start scanning backward, form the post / below forms from H,C sequences. Then see the Base, and C,H sequences after that would become half. That's I suppose how AAT fonts do it. But that's not in OpenType.

Also note that Indic shaper limits some of those features to within syllable only.

rajeeshknambiar commented 3 years ago

Basically, Indic text looks like: C,H,C,H,...,C,H,C. One of these is considered Base. The C,H'es that come before base form half forms. The ones after Base form below or post forms. Now. I do believe ALL of this could be implemented without much shaper logic IF reverse contextual lookups were possible (the simple one in GSUB doesn't count since it can't ligate). If lookup-direction flags are implemented, that still would be possible. One would start scanning backward, form the post / below forms from H,C sequences. Then see the Base, and C,H sequences after that would become half. That's I suppose how AAT fonts do it. But that's not in OpenType.

Just a side note related to this observation. After trying various ‘prescribed’ GSUB shaping forms, I ended up defining a definitive mlm2 feature solely comprising of akhn (with the exception of a couple of pstf forms for Ya and Va to make Uniscribe/DirectWrite work); reference https://gitlab.com/rit-fonts/malayalam-shaping/-/blob/main/features/mlm2-gsub.fea. Every time I attempted a rewrite, I have come to the same conclusion that developing Indic shaping rules would have been much easier if it applied backwards (from end of syllable to the start).

n8willis commented 3 years ago

I put some potential backtrack/match/lookahead info into little tables in #135.

Curious if anyone finds that approach helpful in context. I went with Malayalam for the first swing since it includes some backtrack and some variation in what is required.

Please note that at present I'm MORE interested in whether this method of including the info is helpful than in the specifics of the markup (which is going to be tricky), and please also note that I just used the MS definitions in this example. If it's worth pursuing I'd shake out the details of what's actually needed in each feature.

n8willis commented 3 years ago

@simoncozens No pressure, but as this was originally your question I would be most interested if you think the approach in the PR seems like a good way forward.

simoncozens commented 3 years ago

Yes, that looks nice - thank you!

n8willis commented 3 years ago

PR #135 now contains the equivalent tables for the other Indic2 scripts as listed in the comment above ... which is to say, "as the MS doc pages put them."

However, those are clearly imperfect (among other things, they include a context description for Reph in Tamil, which shouldn't be using the feature(?) in the first place).

Anyone with access is welcome to review those commits and note problems that you're aware of as being incorrect. Or to suggest alternate phrasing/formatting of the table elements — that is non-trivial.

There is also the wider-open question of which of the other features should have a similar table treatment, but is missing one in the MS pages.

My first guess would be that there's no point in worrying about adding such tables to the presentation forms features, as a whole. And GPOS features might similarly be not worth the added complication. Although if there are presentation features, we should add them (init, for example, benefits from it in Bengali, but that's always been such an exception anyway...).

n8willis commented 2 years ago

I went ahead and merged the set as it stands (stood) in the PR, since there have not been any changes to the input-context descriptions in the MS pages (circa November 2021 edits of those).

Like I mentioned in the previous comment, it's clear even to me that there's inconsistency in those MS pages on that point, but I think that amounts to a bigger issue. The ideal fix would start upstream there, I guess. Or, alternatively, to re-evaluate if the un-discussed contexts are actually predictable enough to call them out in tables despite the MS pages not specifying them.

For now, though, if anything's unclear it'd be good to hear about it.

n8willis commented 2 years ago

Closing this for now; please open a new issue to further discuss "new"/superseding-MS context-descriptor info to add or to discuss fixes to the algorithms / OpenType models themselves of course.

n8willis / opentype-shaping-documents

[Indic] Input masks for features #110