n8willis / opentype-shaping-documents

Documentation of OpenType shaping behavior
170 stars 13 forks source link

[Indic] Syllable-tail Avagraha? #60

Closed adrianwong closed 5 years ago

adrianwong commented 5 years ago

The regex in our Indic spec states that a syllable tail can contain up to three Avagraha characters.

However, HarfBuzz's syllable tail regex has up to three As (A{0,3}), where A (Anudatta?) is equivalent to INDIC_SYLLABIC_CATEGORY_CANTILLATION_MARK. INDIC_SYLLABIC_CATEGORY_AVAGRAHA is considered a symbol.

Just wondering if there may be an error in our interpretation here, as our spec doesn't permit Avagrahas to stand alone as their own syllable, whereas the major shaping engines do.

lianghai commented 5 years ago

Avagraha does not really fall in the structure of a typical akshara. Instead, it’s more like Tamil aytham ஃ (btw, it’s a misnomer as “TAMIL SIGN VISARGA”), as it does not have a clear dependency on either side’s base, and is more of an outlier in this fundamental analysis of akshara-based segmentation. And also avagraha is often understood by users as a punctuation mark.

The limitation of “up to three Avagraha characters” in a syllable tail doesn’t make much sense either. Generally it’s just better to treat a structure as an independent character if no specific shaping interaction is needed for it, otherwise users just face too many arbitrary limitations.

I think I’ve also seen it being used without am immediately preceding akshara, for example, “शिवोऽहम्” written as “ शिवो ऽहम्” instead.

n8willis commented 5 years ago

I think that all of HarfBuzz's "regexp is allowed up to N consecutive characters" instances are only there to limit the size of the memory used when processing an actual string to something finite.

What @adrianwong found is a typo on my part from the "A" category getting mixed up between Anudatta & Avagraha. Especially since fixing that would both (a) let Avagraha stand alone and (b) not make it confusing to deal with in a syllable, as @lianghai notes. [Apologies for introducing yet another "(a)" in the preceding sentence....]

behdad commented 5 years ago

I think that all of HarfBuzz's "regexp is allowed up to N consecutive characters" instances are only there to limit the size of the memory used when processing an actual string to something finite.

Not really. They are there to emulate closely what Uniscribe allows, plus allowing what we thought should be allowable, while disallowing grossly wrong sequences...

We used to limit number of consonants to 4 or 5 because Uniscribe did that. We wanted to keep our test failure numbers down... Recently I removed that number to save binary size...

adrianwong commented 5 years ago

We currently map ISC_Cantillation_Mark to vedic, whereas HarfBuzz currently maps ISC_Cantillation_Mark to anudatta. They've also dropped the vedic category as of this commit.

(A cursory search on Google tells me that anudatta is a vedic accent, along with udatta and svarita.)

Both our specification and OpenType's currently reference vedic signs when:

n8willis commented 5 years ago

I have a proposed fix in #82 -- take a look when you get a chance!

n8willis commented 5 years ago

Fixed by #82