w3c / iip

Documenting gaps and requirements for support of Indic languages on the Web and in eBooks.
https://w3c.github.io/iip/
9 stars 15 forks source link

Does the ilreq ABNF work for consonant clusters that don't form conjuncts? #18

Open r12a opened 6 years ago

r12a commented 6 years ago

This issue is carried over from an unanswered issue at https://github.com/w3c/ilreq/issues/31

In the following i distinguish between consonant clusters and conjuncts, where the latter involves special shaping and the hiding of the VIRAMA, because that's where the difference lies afaict.

In the ilreq doc section 2. Indic orthographic syllable boundaries, contains a set of ABNF rules for indicating syllable boundaries, which are referred to for many applications, such as vertical text, line wrapping, initial-letter styling, etc. The examples include Tamil, however (with the exception of க்ஷ, ஶ்ரீ , and ஸ்ரீ ) modern consonant clusters in Tamil don't form conjuncts in the same way as, say, Devanagari or Bengali. Instead, Tamil simply applies a pulli (virama) dot above the consonant without a following vowel, eg. கேட்டுக்.

Given a Tamil word such as யாவற்றையும் (yāvaṟṟaiyum), should the break points for text segmentation, line breaking , drop letter (if the cluster appeared at the start of the text), letter spacing in horizontal text, and vertical text representation conform to this:

A screen shot 2017-12-06 at 09 21 15

or this?

B screen shot 2017-12-06 at 09 21 32

The latter is what the ilreq document currently suggests.

A similar question arises when fonts don't produce certain conjuncts in other scripts, for one reason or another, or where a ZWNJ is added to prevent a conjunct forming. Where are the break points for the following? Are they:

C screen shot 2017-12-06 at 09 50 50

or

D screen shot 2017-12-06 at 09 51 06

Given that for a more typical rendering of the text the break points, as described in the ilreq doc, would be:

E screen shot 2017-12-06 at 09 51 18

UAX#29 currently doesn't produce E for Devanagari (which is what the ilreq doc requires). It produces something more like C. But UAX#29 is about to change, so that by default a whole consonant cluster will be seen as a unit (ie. E). The effect of that upcoming change is not completely clear, however, for scripts like Tamil, or Devanagari when the virama is showing. I'm looking for someone to provide expert advice for what would be expected in those situations.

A reliance on the shape of the text is not described in the ilreq document, which i think is problematic. (It's also problematic for the general concept of grapheme clusters in Unicode, which should count as one unit the whole of a conjunct sequence such as ஶ்ரீ but not a Tamil consonant cluster such as த்தை.)

akshatsj commented 6 years ago

I am not a native Tamil speaker however have worked on the issue at ICANN's Neo-Brahmi Generation Panel where similar work was undertaken to identify some sort of validation rules for prospective domain name labels. The said validation rules were in a way doing syllable boundary identification and enforcing proper akshar formation of Indian language domain names. As I see, a native Tamil speaker would only say that there are only two conjuncts in Tamil which are ksha and shree. Apart from these, there are none. By this what they mean is, their interpretation of a valid conjunct cluster is limited to these two conjuncts. Any other (apart from ksha and shree) CHC combination, probably is CH | C (two separate akshars) and expects a cursor to stop, line to break and drop cap to end at the end of H. My thoughts.

So, the ILreq ABNF may need to be changed to accommodate this requirement.

r12a commented 6 years ago

And would you say that the same rule applies for devanagari text such as the examples above, where the virama is explicitly shown? (This is where things become difficult, because the explicit virama may be a side-effect of the font, rather than an encoding difference, but i'd like to see if we can at least first clarify what the user would expect.)

r12a commented 6 years ago

This issue was discussed in a meeting.

r12a commented 5 months ago

See also https://github.com/w3c/ilreq/issues/31