n8willis / opentype-shaping-documents

Documentation of OpenType shaping behavior
164 stars 15 forks source link

Needing guidance for ambiguity about `init` feature scope in Bengali #104

Open n8willis opened 3 years ago

n8willis commented 3 years ago

At present, the Microsoft script-guidance page for Bengali (beng and bng2 script tags) states that the init GSUB feature should only apply to left-side matra glyphs when they appear in the word-initial position. (And should not apply to other letters even when they appear in word-initial position.)

The fact that init applies at all is somewhat of an outlier, since otherwise the feature is primarily designed to work with Arabic and other cursive, joining scripts.

But the wording was different up through at least December 2017 (visible in this Wayback Machine link), saying instead "This feature takes nominal (full) forms of consonants and produces initial forms when the glyph is at the beginning of a word" even though the example image is a left-side matra.

The new wording comes from a change proposed by John Hudson in 2016, which followed a TypeDrawers discussion thread in which people indicated varying levels of expectation about whether or not init (and fina) should be implemented for other scripts.

It certainly seems like some fonts may exist that exploit those features for (e.g.) cursive styles of Latin. For shaping-engine authors, however, the more particular question is whether it's right or wrong to apply the feature to letters other than left-side matras.

Sticking strictly to the spec, it would be a "today no" but there may be old fonts in the wild that expect it, so perhaps a note of guidance ought to go in somewhere.

khaledhosny commented 3 years ago

I don’t think HarfBuzz or Uniscribe ever applied these feature to other scripts, so the presence of such fonts is largely irrelevant.

khaledhosny commented 3 years ago

Or Core Text, or probably anything other than InDesign.

lianghai commented 3 years ago

I also believe the current patched wording correctly reflects both shapers’ ability and fonts’ usage/expectation of this OTL feature in reality. Shaping stuff other than Bangla vowel signs with automatically applied init was never considered generally doable by font producers.

n8willis commented 3 years ago

So, if I'm following both of you correctly:

(a) there's no value to worrying about any scripts other than bng2 (and those already handled as expected by the Arabic shaper model of course)

(b) There's an outside chance somebody has had decorative or other funky fonts for latn or whatnot that exploited the feature in some apps. At most, that would be worth footnoting, since the real fix there would be to update the font.

(c) People probably didn't apply init substitutions to non–vowel-signs in bng2 fonts or, if they did, we don't have any notable examples to refer to. So my take would again be that we could footnote that (in errata or whatever) and if the day arises in some distant future when people demand more, we'll know that they didn't image the whole thing and future us will be fascinated.

lianghai commented 3 years ago

(Still trying to grasp your attitude towards this issue. I find myself often have difficulty to switch from a font developer’s point of view to a shaper developer’s…)

(a) …

Arabic and other Arabic-cursive-joining scripts, scripts handled by the USE, as well as beng and bng2—afaik (I don’t really know this particular area so well) these should be the only three groups that have been cleared by Microsoft’s “script development specs” to expect the init feature is applied by shapers and is applied with proper masking.

(b) …

Yes, or just relaxing the script restriction in the spec. From a shaper’s point of view, the footnote, if any, should probably be based on whether there’s any known shaper out there actually applying the init feature with the intended masking behavior.

(c) …

@tiroj should be able to provide more confirmation on this.

tiroj commented 3 years ago

Backing up a bit: the revised init (medi, fina, and isol) feature descriptions came about because the old ones didn't accurately describe what shaping engines were actually doing. The original feature descriptions talked about applying the features based on word-position, but word-position analysis is not how Arabic, Syriac, etc. shaping is performed, because scripts with normative cursive joining behaviour include both joining and non-joining characters mid-word, so what matters is the joining properties of adjacent letters, not their position in the word. So the feature descriptions were rewritten to specify that they relate directly to implementation of Unicode ArabicJoining.txt properties, and not to word analysis. [I noted at the time that this left open the possibility of defining new features specifically to apply word-positional or e.g. line-positional forms, independent of joining behaviour.]

The Bengali case is the one-off exception to the general rule that init applies only to ArabicShaping.txt joining properties. This is because all beng and bng2 shaping engines apply the feature based on analysis of U+0DC7 and U+0DC8 occurring at the beginning of a word. This is the only standardised and specified case in which word-positional analysis is performed by shaping engines. When Indic shaping was being worked out at Microsoft, it was noted that writing and typography had this feature in which word-initial forms of these vowel signs did not have a spur on the left side, so rather than either requiring this to be handled with contextual substitutions (which wouldn't work within the broken context range of Indic shapers) or defining a specific Bengali feature, the init feature was specified for this purpose. So the exceptional use for beng and bng2 is retained; note that bng3 for USE processing would not be able to use init for this purpose, and would need to implement the substitution contextually in the GSUB lookup.

Yes, I think there probably are some Latin cursive style fonts that tried to use init, medi, fina, and isol for letter shaping. To my knowledge, they wouldn't have worked very widely, and their makers probably would have instead chosen to use contextual substitutions if they cared about the fonts working across a range of platforms and applications.

tiroj commented 3 years ago

PS. If Microsoft had decided to require the Bengali initial vowel sign substitution to be handled using contextual substitution, rather than via init, they would presumably have very quickly realised that their context range was broken for Indic scripts, and might have fixed it. So it is a great pity that they didn't.

adrianwong commented 3 years ago

I should note that HarfBuzz does the following, which I read as "if a left matra is not word-initial and the preceding character falls outside a range of General Category classes, apply the init feature to the left matra anyway."

The two relevant commits (here and here) lead me to believe that this was done to imitate Uniscribe's behaviour.

Allsorts now follows suit, so perhaps this should be formalised.

(Apologies - this is not related to init's feature scope, but feels like an appropriate place to post this.)

n8willis commented 3 years ago

When Indic shaping was being worked out at Microsoft, it was noted that writing and typography had this feature in which word-initial forms of these vowel signs did not have a spur on the left side, so rather than either requiring this to be handled with contextual substitutions (which wouldn't work within the broken context range of Indic shapers) or defining a specific Bengali feature, the init feature was specified for this purpose. So the exceptional use for beng and bng2 is retained; note that bng3 for USE processing would not be able to use init for this purpose, and would need to implement the substitution contextually in the GSUB lookup.

Many thanks for the clarifications. If it's not beating-a-dead-tangent (and solely to put a tiny piece of my mind at ease), was the c.2017 allusion to init applying to word-initial consonants just a simple typo / transcription slip? Or was there some rationale that later got corrected?

Yes, I think there probably are some Latin cursive style fonts that tried to use init, medi, fina, and isol for letter shaping. To my knowledge, they wouldn't have worked very widely, and their makers probably would have instead chosen to use contextual substitutions if they cared about the fonts working across a range of platforms and applications.

Check.

n8willis commented 3 years ago

I should note that HarfBuzz does the following, which I read as "if a left matra is not word-initial and the preceding character falls outside a range of General Category classes, apply the init feature to the left matra anyway."

Okay, so it's basically saying consider it a "word start" if there's a non-letter-and-non-mark codepoint before it (plus related whatnot).... That definitely makes sense; numerals and punctuation and so on.

Would that be a situation that ought to already get handled before it gets to the shaper, though? As in, it's part of segmenting the text run. Doesn't mean the shaping engine shouldn't be aware of it, of course. Just a question about what the standard MO is.

tiroj commented 3 years ago

It’s sort of typical of the OTL feature specifications that the init feature would state—at least before the rewrite, and still for beng and bng2—that the feature should be applied to a word-initial glyph without actually specifying how a word-initial glyph is to be determined. The HarfBuzz behaviour sounds sensible, and probably is what Uniscribe/DWrite does too, but so far as I know this is among the implementation details that are nowhere specified.

n8willis commented 3 years ago

One last question: @tiroj, you say init was applied to Bengali because there was evidence of the treatment in real text on U+09C7 and U+09C8 (presuming U+0DC7 and U+0DC8 clearly just a slip of the finger). Should U+09BF definitely be excluded?

I could imagine that the difference in shapes between the matras would make a distinction in everyday practice, but I'm a little wary of being so prescriptive.

tiroj commented 3 years ago

I’ve not seen the ikar get a word-initial form. The ductus of the letter is different from the ekar shape, so doesn't lend itself in the same way to a spurless head line connection. That said, I am unsure whether shaping engines would make the distinction, or if they would simply process the init lookup for the first glyph in a word, regardless of what that word is.

Something one does see in some Bengali fonts, notably in display and headline types, is word-final forms of iikar where the head line does not extend to the right of the letter. This needs to be handled using contextual substitutions, but support is hampered by the cluster boundary model still applied to GSUB in Microsoft and Adobe engines even for the rclt feature.