n8willis / opentype-shaping-documents

Documentation of OpenType shaping behavior
170 stars 13 forks source link

[Indic] Post-base and below-base consonant ordering in base consonant search #66

Closed adrianwong closed 3 years ago

adrianwong commented 5 years ago

There is a note in the OpenType base consonant search algorithm that I'd glossed over previously, which is that "post-base forms have to follow below-base forms".

From the corpus testing, there appear to be some syllables out in the wild that have that specified in reverse, in which case HarfBuzz takes the out-of-order post-base consonant as base.

E.g. U+0995 (Ka - Base), U+09CD (Halant), U+09AC (Ba), U+09CD (Halant), U+09AF (Ya)

vs

U+0995 (Ka), U+09CD (Halant), U+09AF (Ya - Base), U+09CD (Halant), U+09AC (Ba)

lianghai commented 5 years ago

This is quite a thought-provoking issue.

In terms of the Bangla writing system’s orthography, when a basign (or a rasign) and a yasign coexist in an akshara, it’s indeed unlikely that the phonetic sequence is anything besides /…by…/ (or /…ry…/), therefore this restriction makes sense to some extent. But I don’t think it’s appropriate to make a static restriction at the shaper level.

It feels some dynamic-property mechanism is necessary. Also this issue feels like related to the reordering behavior of vowel signs and conjoining forms, but I can’t yet tell what the connection is.

Note that for other scripts it’s just very likely for a below-base conjoining sign to be encoded after a post-base conjoining sign (although the industry tends to try to avoid such situations because it’s confusable), such as Telugu ర్క్ర when the rasign is shaped as a below-base form visually either under the base or under the post-base kasign. In practice, it seems many (if not all) fonts form all of Telugu’s conjoining signs in the blwf feature, but this just comes to the problem that the difference between a post-base conjoining sign and a below-base conjoining sign is just not clear in many scripts, and OTL’s distinction between them is quite arbitrary (and mixed with reordering considerations).

n8willis commented 3 years ago

In light of Liang's comments, I've added this to the errata. In particular, noting that the Microsoft script-dev spec contains identical wording for all scripts, which seems to be demonstrably an overstatement, and likely is just in line with other "overly identical" sections between scripts.