w3c / ilreq

Former repo for Indic Layout Requirements. See new repo at
https://github.com/w3c/iip/
Other
10 stars 11 forks source link

When does the ABNF work for Tamil consonant clusters? #31

Closed r12a closed 6 years ago

r12a commented 7 years ago

The document largely gives the impression that the ABNF rules indicate what must be kept together for "text segmentation, line breaking , drop letter, letter spacing in horizontal text and vertical text representation".

However, is that true for Tamil? Consonant clusters in Tamil don't interact with left-positioned vowel signs in the same way as Devanagari or Bengali conjuncts. Here are some examples i took from the UDHR.

  1. in these words the left-positioned vowel appears between the two consonants in a cluster: யாவற்றையும் yāvaṟṟaiyum

    கௌரவத்தையும் kauravattaiyum

    அசிரத்தையும் அவற்றை acirattaiyum avaṟṟai

    ஏற்கப்பெற்று ēṟkappeṟṟu

    எல்லோரும் ellōrum

  2. in these the vowel shaping interacts only with the final consonant: செயல்களுக்கு ceyalkaḷukku

    கேட்டுக் kēṭṭuk

The table of examples of the ABNF doesn't include this type of cluster, only conjuncts such as க்ஷ, ஶ்ரீ , and ஸ்ரீ , which are special because they ligate.

So, given examples such as those in the list above, is it or is it not normal to keep consonant clusters together in Tamil for text segmentation, line breaking , drop letter, letter spacing in horizontal text and vertical text representation?

miloush commented 7 years ago

What do you mean by "normal" (and is "normal" relevant)? Also what do you mean by "must" - that rendering would be broken, or that the text would feel strange to the reader? What timeframe are we considering to the past?

I have seen drop caps made of syllables, or just the consonant of a syllable. I have seen vertical text by syllables. I think I haven't seen a drop cap of the vowel mark only, but wouldn't be that much shocked.

Let me point out that there was a script reform in the late seventies prior to which consonants were interacting with left-positioned vowel signs, mostly with AI. Some fonts are still using those ligatures, or offer them as historic ligatures.

Richard57 commented 7 years ago

As far as I can tell, the ABNF works for Tamil in Tamil script when there is no pulli (U+0BCD TAMIL SIGN VIRAMA) in sight. You can get a flavour of how Tamils feel about their script from TACE16 (a.k.a. TUNE). See the invective at https://en.wikipedia.org/wiki/Tamil_All_Character_Encoding. The only conjuncts I am aware of are those involving <kṣ> க்ஷ <U+0B95, U+0BCD, U+0BB7> and 'shri' ஸ்ரீ <U+0BB8, U+0BCD, U+0BB0, U+0BC0>. Otherwise, U+0BCD terminates an orthographic syllable.

Tamil seems to be the good example of an abugida as a neosyllabary.

The ABNF, which doesn't even work for Sanskrit in the Devanagari, also fails massively for varga-distinguishing Sanskrit in Tamil script. Subscript or superscript numbers are used to distinguish the 4 plosive vargas, for which there is mostly only a single letter in Tamil. For examples of this scheme , one can look at http://sanskritdocuments.org/tamil/by-category/krishna.php.

r12a commented 7 years ago

As far as I can tell, the ABNF works for Tamil in Tamil script when there is no pulli (U+0BCD TAMIL SIGN VIRAMA) in sight.

Yes. In my comment i tried to distinguish between consonant clusters and conjuncts, where the latter involves special shaping and the hiding of the VIRAMA, because i assumed that that's where the difference lies.

If this is an appropriate distinction for application of the ABNF rules, however, there is presumably a problem, since if one were to apply a font to Tamil that contains shaping based on the older forms of the script (mentioned by @miloush), the ANBF would be relevant for sequences of characters for which it wasn't relevant before.

Such a reliance on the shape of the text is not described in the document, which i think is problematic. (It's also problematic for the general concept of grapheme clusters in Unicode, which should count as one unit the whole of a conjunct sequence such as ஶ்ரீ but not a Tamil consonant cluster such as த்தை.)

Richard57 commented 7 years ago

@miloush only mention ligatures of vowels and consonants. The reason that they might be relevant is a natural reluctance to break a ligature.

I believe the potential problem on fonts is more likely to apply to Devanagari, where the deliberate appearance of a halant should normally signal the end of an orthographic syllable, than to Tamil. It is not for nothing that UAX#29 cautions that the tailoring of grapheme clusters may be font dependent. Malayalam may be an interesting study in this regard.

r12a commented 6 years ago

Let me try to make my question clearer. It is only about situations where the pulli is visible.

Given a word such as யாவற்றையும் (yāvaṟṟaiyum), should the break points for text segmentation, line breaking , drop letter (if the conjunct appeared at the start of the text), letter spacing in horizontal text, and vertical text representation conform to this?

A screen shot 2017-12-06 at 09 21 15

or this?

B screen shot 2017-12-06 at 09 21 32

The latter is what the ilreq document currently suggests.

A similar question arises when fonts don't produce certain conjuncts, for one reason or another, or where a ZWNJ is added to prevent a conjunct forming. Where are the break points for the following? Are they:

C screen shot 2017-12-06 at 09 50 50

or

D screen shot 2017-12-06 at 09 51 06

Given that for a more typical rendering of the text the break points, as described in the ilreq doc, would be:

E screen shot 2017-12-06 at 09 51 18

miloush commented 6 years ago

@r12a your Tamil example is only interesting because the doubled consonant results in a phonetic change, but I don't see any reason why B should be preferred over A. Even the script supports A, as otherwise you would expect the ai sign to be in front of the first .

Note that you can really find pretty much any breaking for vertical text around: த்தகங்கள்

A is consistent with caret stops when editing documents from my experience. Either way, is there a reason to not just follow/refer UAX#29 Unicode Text Segmentation?

I don't have enough experience with Devenagari, but from technical point C makes more sense to me, especially if there is ZWNJ.

r12a commented 6 years ago

the doubled consonant results in a phonetic change, but I don't see any reason why B should be preferred over A

Well, yes, that's exactly my point. :) The ilreq document currently suggests that only B is correct, and i'm asking whether that is true.

is there a reason to not just follow/refer UAX#29 Unicode Text Segmentation?

UAX#29 currently doesn't produce E for Devanagari (which is what the ilreq doc requires). It produces something more like C. But UAX#29 is about to change, so that by default a whole consonant cluster will be seen as a unit (ie. E). The effect of that upcoming change is not completely clear, however, for scripts like Tamil, or Devanagari when the virama is showing. I'm looking for someone to provide expert advice for what would be expected in those situations.

Richard57 commented 6 years ago

If the visible viramas in C are all produced by ZWNJ, then the grapheme cluster boundaries will remain as the breaks in C. However, CLDR is not the right place to preserve A; Tamil pulli should be removed from the category of virama. I believe A is also appropriate for Sanskrit in Tamil script, but do we expect browsers to look up a Sanskrit locale for the rendering of shlokas? Tamil K.SSA and SH.RII are problems, for their consonants do belong in the same grapheme cluster.

r12a commented 6 years ago

I am closing this issue in favour of https://github.com/w3c/iip/issues/18