w3c / iip

Documenting gaps and requirements for support of Indic languages on the Web and in eBooks.
https://w3c.github.io/iip/
8 stars 15 forks source link

Independent vowels are confusing #95

Open r12a opened 4 years ago

r12a commented 4 years ago

Like other Indic scripts, Gurmukhi has independent vowels which may be visualised as made up of 2 code points, whereas Unicode provides precomposed code points for each independent vowel. The precomposed code points and the decomposed sequences that may be rendered to look the same are not canonically equivalent in Unicode, and therefore may be problematic for users who are unaware.

This is particularly pronounced for Gurmukhi because in principle independent vowels are (visually) a vowel carrier plus a vowel sign. For more information see Standalone vowels.

Searching Google for the word ਅਾਲੂ (potato), where the initial 'a' sound is composed of 2 code points, rather than the precomposed code point recommended by Unicode, produces 2,570 pages, compared to 361,000 using the precomposed character. While this is small in comparison (0.7%), it is large enough to indicate an issue.

Browsers should be able to recognise the decomposed sequences and treat them as equivalent to the precomposed code points for sorting, search, collation, etc.

Many fonts produce a dotted circle or fail to correctly align the glyphs of the decomposed sequence, which also helps reduce this issue, however some fonts do not (such as the Gurmukhi MN Mac system font).

r12a commented 4 years ago

The first comment in this issue contains text that will automatically appear in the Gurmukhi gap-analysis document as a subsection with the same title as this issue. Any edits made to that comment will be immediately available in the document. Proposals for changes or discussion of the content can be made in comments below this point.

lianghai commented 4 years ago

This is not a Gurmukhi-specific issue. All Indic script encoded with the ISCII model suffer from this issue, for example, Devanagari आ ≠ अ + ा. A mismatch between phonetic segmentation (the base of the ISCII model) and graphic segmentation of text.

The only special aspect in Gurmukhi is that, the three vowel-sign carriers (not “independent vowels”) recognized by the native analysis as letters are not all used as independent vowels (ie, ੳ and ੲ being non-independent-vowel vowel-sign-carrier native letters).

But from a confusability’s point of view, ੳ and ੲ are just two directly encoded letters (and thus accessible to users when inputting). Gurmukhi ਉ ≠ ੳ + ੁ isn’t quite different from Malayalam ഓ ≠ ഒ + ാ.

r12a commented 1 year ago

Rewrote Kulpreet's original text. Will add links to the upcoming Gurmukhi layout page when it is available.