n8willis / opentype-shaping-documents

Documentation of OpenType shaping behavior
161 stars 15 forks source link

[Indic] Sinhala base consonant search (spec/Uniscribe vs. HarfBuzz) #123

Open adrianwong opened 3 years ago

adrianwong commented 3 years ago

HarfBuzz adopts a slightly different approach to Uniscribe / this spec. Consider the syllable (taken from our corpus):

ක්‍ග්යේ

U+0D9A SINHALA LETTER ALPAPRAANA KAYANNA (Consonant) U+0DCA SINHALA SIGN AL-LAKUNA U+200D ZERO WIDTH JOINER U+0D9C SINHALA LETTER ALPAPRAANA GAYANNA (Consonant) U+0DCA SINHALA SIGN AL-LAKUNA U+0DBA SINHALA LETTER YAYANNA (Consonant) U+0DDA SINHALA VOWEL SIGN DIGA KOMBUVA

Following this spec, we start at the end of the syllable until we find the consonant U+0DBA SINHALA LETTER YAYANNA. It is not immediately preceded by a ZWJ, therefore it is the base.

The left matra U+0DD9 SINHALA VOWEL SIGN KOMBUVA (via decomposition of U+0DDA SINHALA VOWEL SIGN DIGA KOMBUVA) then moves up prior to this base, giving us:

sinh-base-uniscribe-allsorts

From a quick read of the HarfBuzz source code, what it appears to be doing is starting at the beginning of the syllable and taking the last consonant that is not immediately preceded by a ZWJ.

Therefore, U+0D9A SINHALA LETTER ALPAPRAANA KAYANNA is the base, as the base consonant search is subsequently terminated on encountering the U+200D ZWJ, U+0D9C GAYANNA pair. This gives us:

sinh-base-harfbuzz

n8willis commented 3 years ago

So I've been trying to get my head around this (as Pathum has noted, the upstream specs are extremely underspecified in some places; undoubtedly the most recent MS docs are better than they used to be, but they still don't give anywhere near as much detail in Sinhala as they do in the multi-script Indic2 spec).

It does seem like U+0DBA takes on post-base form via post, regardless of the base search. Are we sure that the difference seen in this test string is about the "start at the beginning" method and not that post-base form?

I mean, either way, the base-search text ought to be clearer. But explicitly saying "skip consonants that take on post-base form" would be in line with what we say in the Indic2 docs, and that is an easier solution than figuring out why there's an unexpected mismatch in how the searches terminate.

adrianwong commented 3 years ago

It does seem like U+0DBA takes on post-base form via post, regardless of the base search. Are we sure that the difference seen in this test string is about the "start at the beginning" method and not that post-base form?

"skip consonants that take on post-base form"

Did you mean pstf, and not post? My understanding is that pstf is used in Sinhala for a different purpose (or it's meant to anyway), i.e. for splitting multi-part matras.

If it is vatu we're actually referring to, the U+0DBA in this example should not form the post-base Yansaya because it is not preceded by a ZWJ. Besides, HarfBuzz doesn't check for post-base forms during the Sinhala base consonant search.

adrianwong commented 3 years ago

But explicitly saying "skip consonants that take on post-base form" would be in line with what we say in the Indic2 docs, and that is an easier solution than figuring out why there's an unexpected mismatch in how the searches terminate.

Agreed, and a quick test shows that this appears to be how DirectWrite handles it. Using a variant of the example in the original post:

ක්‍ග්‍යෙ

U+0D9A SINHALA LETTER ALPAPRAANA KAYANNA (Consonant) U+0DCA SINHALA SIGN AL-LAKUNA U+200D ZERO WIDTH JOINER U+0D9C SINHALA LETTER ALPAPRAANA GAYANNA (Consonant) U+0DCA SINHALA SIGN AL-LAKUNA U+200D ZERO WIDTH JOINER U+0DBA SINHALA LETTER YAYANNA (Consonant) U+0DD9 SINHALA VOWEL SIGN KOMBUVA

gives us:

uniscribe-sinhala

My interpretation of this output is that U+0DBA takes on post-base form, but U+0D9C doesn't despite being preceded by a ZWJ, therefore U+0D9C is the base (indicated by the U+0DD9 matra moving before it).

(HarfBuzz still places the matra at the start of the syllable.)

Nice insight/intuition, @n8willis!