Open adrianwong opened 3 years ago
So I've been trying to get my head around this (as Pathum has noted, the upstream specs are extremely underspecified in some places; undoubtedly the most recent MS docs are better than they used to be, but they still don't give anywhere near as much detail in Sinhala as they do in the multi-script Indic2 spec).
It does seem like U+0DBA takes on post-base form via post
, regardless of the base search. Are we sure that the difference seen in this test string is about the "start at the beginning" method and not that post-base form?
I mean, either way, the base-search text ought to be clearer. But explicitly saying "skip consonants that take on post-base form" would be in line with what we say in the Indic2 docs, and that is an easier solution than figuring out why there's an unexpected mismatch in how the searches terminate.
It does seem like U+0DBA takes on post-base form via
post
, regardless of the base search. Are we sure that the difference seen in this test string is about the "start at the beginning" method and not that post-base form?"skip consonants that take on post-base form"
Did you mean pstf
, and not post
? My understanding is that pstf
is used in Sinhala for a different purpose (or it's meant to anyway), i.e. for splitting multi-part matras.
If it is vatu
we're actually referring to, the U+0DBA in this example should not form the post-base Yansaya because it is not preceded by a ZWJ. Besides, HarfBuzz doesn't check for post-base forms during the Sinhala base consonant search.
But explicitly saying "skip consonants that take on post-base form" would be in line with what we say in the Indic2 docs, and that is an easier solution than figuring out why there's an unexpected mismatch in how the searches terminate.
Agreed, and a quick test shows that this appears to be how DirectWrite handles it. Using a variant of the example in the original post:
ක්ග්යෙ
U+0D9A SINHALA LETTER ALPAPRAANA KAYANNA (Consonant) U+0DCA SINHALA SIGN AL-LAKUNA U+200D ZERO WIDTH JOINER U+0D9C SINHALA LETTER ALPAPRAANA GAYANNA (Consonant) U+0DCA SINHALA SIGN AL-LAKUNA U+200D ZERO WIDTH JOINER U+0DBA SINHALA LETTER YAYANNA (Consonant) U+0DD9 SINHALA VOWEL SIGN KOMBUVA
gives us:
My interpretation of this output is that U+0DBA takes on post-base form, but U+0D9C doesn't despite being preceded by a ZWJ, therefore U+0D9C is the base (indicated by the U+0DD9 matra moving before it).
(HarfBuzz still places the matra at the start of the syllable.)
Nice insight/intuition, @n8willis!
HarfBuzz adopts a slightly different approach to Uniscribe / this spec. Consider the syllable (taken from our corpus):
Following this spec, we start at the end of the syllable until we find the consonant
U+0DBA SINHALA LETTER YAYANNA
. It is not immediately preceded by a ZWJ, therefore it is the base.The left matra
U+0DD9 SINHALA VOWEL SIGN KOMBUVA
(via decomposition ofU+0DDA SINHALA VOWEL SIGN DIGA KOMBUVA
) then moves up prior to this base, giving us:From a quick read of the HarfBuzz source code, what it appears to be doing is starting at the beginning of the syllable and taking the last consonant that is not immediately preceded by a ZWJ.
Therefore,
U+0D9A SINHALA LETTER ALPAPRAANA KAYANNA
is the base, as the base consonant search is subsequently terminated on encountering theU+200D ZWJ, U+0D9C GAYANNA
pair. This gives us: