w3c / iip

Documenting gaps and requirements for support of Indic languages on the Web and in eBooks.
https://w3c.github.io/iip/
9 stars 15 forks source link

Tamil conjuncts are not selected as a single unit when styling initials #116

Open r12a opened 3 years ago

r12a commented 3 years ago

When the start of a line contains a consonant cluster that uses a conjunct (rather than visible virama), ::first-letter should highlight the whole cluster. Usually, modern Tamil has only two of these conjuncts, however one of them can be created in two ways (making a total of 3 clusters to test).

This doesn't work well if segmentation relies on Unicode grapheme clusters, since a conjunct with two consonants will be parsed as two grapheme clusters (the first ending after the virama, and the second starting with the second consonant and including any following vowel-signs or other combining characters).

For these situations it is necessary to tailor the segmentation algorithm, so that it recognises the whole consonant cluster plus any attached vowel-signs or combining characters as a single unit. This is a particular issue for Tamil, since all other clusters are typically decomposed and show the virama.

Specs:

css-text-3 CSS uses the concept of 'typographic character unit', rather than grapheme cluster, in its specs with the explanation that the cases just described go beyond the scope of the grapheme cluster concept and that implementations should provide appropriate support. The spec doesn't provide details about the support needed for each language.

The Unicode Consortium made some attempts to address this issue, but it has so far not yielded results. CLDR now flags up a few scripts for which conjuncts are common. Tamil is not among them.

Tests & results: Interactive test, When ::first-letter is applied to Tamil the browser will select the KSHA and SHRI conjuncts as a single unit
Gecko produces the expected result. Blink, and Webkit only select the first consonant+pulli.

Browser bug reports: ChromiumWebkit

Priority: The impact here is advanced, since the impact of the failures cited here on the user is likely to be very small, especially since they can resort to markup in the rare cases where the conjuncts are not properly handled. Not many words begin with the conjuncts tested. (One example of such would be ஶ்ரீநகர்)

r12a commented 3 years ago

The first comment in this issue contains text that will automatically appear in one or more gap-analysis documents as a subsection with the same title as this issue. Any edits made to that comment will be immediately available in the document. Proposals for changes or discussion of the content can be made in comments below this point.

Relevant gap analysis documents include: _Tamil_

xfq commented 2 years ago

Added links to bug reports.