n8willis / opentype-shaping-documents

Documentation of OpenType shaping behavior
161 stars 15 forks source link

halant + length in Kannada script #163

Open devosb opened 7 months ago

devosb commented 7 months ago

Some smaller languages written in Kannada script use a visible U+0CCD KANNADA SIGN VIRAMA as a vowel sign instead of a vowel killer (which normally is either visible or becomes invisible and causes consonant stacking).

One problem occurs when U+0CD5 KANNADA LENGTH MARK is used to distinguish vowel lengths. The difficulty comes when a language has different vowels than the Kannada language (written in Kannada script). The issue is U+0CCD followed by U+0CD5 produces a dotted circle in OpenType shaping engines. Which makes sense for Kannada language, why would you use a vowel killer after a vowel sign? But for other languages, the dotted circle in not desired.

A second problem occurs when the virama should represent a vowel, but if the virama occurs between two consonants, the virama becomes invisible and a constant stack results. Both of these problems are discussed in a document on the Tulu language written in Kannada script.

A similar issue occurs with another Dravidian language, Malayalam. When writing the Malayalam language in the Malayalam script, a virama can be used as a vowel killer, or as a vowel (samvrittokarama). The discussion of the half-u and the chandrakkala (virama) gives more details.

While the above issues might be a bug in the OpenType shaping engines, I thought it would be best to have a consensus of how to handle these issues before filing multiple bug reports. Please accept my apologies if I could have posted this in a better location and/or copied different people.

@behdad @dscorbett @jfkthame @xadxura @PeterCon @LornaSIL

dscorbett commented 7 months ago

The Kannada script development spec allows one virama after the vowel signs: {M}+[N]+[H]. For example, here is <U+0C95, U+0CD5, U+0CCD> ⟨ಕೕ್⟩ in Noto Sans Kannada in HarfBuzz: ಕೕ್ It’s not the most obvious code point order for Kannada, but it is consistent with how other Indic scripts work in OpenType.

Separate syllables with ZWNJ to avoid a conjunct. Unicode recommends this convention for many Indic scripts, including Malayalam. For example, here is <U+0C95, U+0CCD, U+200C, U+0C95> ⟨ಕ್‌ಕ⟩: ಕ್‌ಕ

devosb commented 6 months ago

Thanks for pointing out the, as you say, not the most obvious code point order. I had not realised that. I tested some other fonts, Noto Serif Kannada and Nirmala UI had a clash, Tunga and Tiro Kannada displayed like your example of Noto Sans Kannada. So if this sequence gets used, I might file bug reports on the needed fonts.

I realised I can get the desired visual form using ZWNJ. However, as discussed in section 4 on page 4 of the Tulu document, ZWNJ is ignored for some text processes. However, Unicode, in section 23.2 Layout Controls - Cursive Connection and Ligatures has a paragraph called Filtering Joiner and Non-joiner (which I am just reading now). While that paragraph starts with the same conclusion as the Tulu document, it goes on to say that (in particular) for Indic scripts ZWJ and ZWNJ should generally be considered for many text processes.

So is Unicode 23.2 enough that I should use ZWJ and ZWNJ to denote orthographic differences that reflect different phonetics and therefore different words? Or does a variation sequence or other mechanism (such as new codepoint) be proposed to handle this situation?

dscorbett commented 6 months ago

Besides section 23.2, section 12.1 “Devanagari” subsection “Alternative Forms of Cluster-Initial RA” gives an example where ZWJ is orthographically significant in an Indic script, and UAX #31 section 2.3 “Layout and Format Control Characters” gives examples of significant ZWJ and ZWNJ, including a Malayalam example which is similar to the case of this Tulu vowel. It would therefore be consistent and reasonable for Unicode to recommend ZWNJ for Tulu in Kannada script, instead of a new code point or other mechanism. Unicode isn’t always consistent between scripts, though, so you can’t know for sure till the standard specifically says so.