Closed adrianwong closed 5 years ago
Avagraha does not really fall in the structure of a typical akshara. Instead, it’s more like Tamil aytham ஃ (btw, it’s a misnomer as “TAMIL SIGN VISARGA”), as it does not have a clear dependency on either side’s base, and is more of an outlier in this fundamental analysis of akshara-based segmentation. And also avagraha is often understood by users as a punctuation mark.
The limitation of “up to three Avagraha characters” in a syllable tail doesn’t make much sense either. Generally it’s just better to treat a structure as an independent character if no specific shaping interaction is needed for it, otherwise users just face too many arbitrary limitations.
I think I’ve also seen it being used without am immediately preceding akshara, for example, “शिवोऽहम्” written as “ शिवो ऽहम्” instead.
I think that all of HarfBuzz's "regexp is allowed up to N consecutive characters" instances are only there to limit the size of the memory used when processing an actual string to something finite.
What @adrianwong found is a typo on my part from the "A" category getting mixed up between Anudatta & Avagraha. Especially since fixing that would both (a) let Avagraha stand alone and (b) not make it confusing to deal with in a syllable, as @lianghai notes. [Apologies for introducing yet another "(a)" in the preceding sentence....]
I think that all of HarfBuzz's "regexp is allowed up to N consecutive characters" instances are only there to limit the size of the memory used when processing an actual string to something finite.
Not really. They are there to emulate closely what Uniscribe allows, plus allowing what we thought should be allowable, while disallowing grossly wrong sequences...
We used to limit number of consonants to 4 or 5 because Uniscribe did that. We wanted to keep our test failure numbers down... Recently I removed that number to save binary size...
We currently map ISC_Cantillation_Mark
to vedic, whereas HarfBuzz currently maps ISC_Cantillation_Mark
to anudatta. They've also dropped the vedic category as of this commit.
(A cursory search on Google tells me that anudatta is a vedic accent, along with udatta and svarita.)
Both our specification and OpenType's currently reference vedic signs when:
I have a proposed fix in #82 -- take a look when you get a chance!
Fixed by #82
The regex in our Indic spec states that a syllable tail can contain up to three Avagraha characters.
However, HarfBuzz's syllable tail regex has up to three
A
s (A{0,3}
), whereA
(Anudatta?) is equivalent toINDIC_SYLLABIC_CATEGORY_CANTILLATION_MARK
.INDIC_SYLLABIC_CATEGORY_AVAGRAHA
is considered a symbol.Just wondering if there may be an error in our interpretation here, as our spec doesn't permit Avagrahas to stand alone as their own syllable, whereas the major shaping engines do.