Open mikeday opened 5 years ago
We have also identified many uses of Latin combining characters applied to Indic text:
U+0300 grave (ta) U+0301 acute (ta, ml) U+0302 circumflex (ta) U+0304 macron (ml) U+0305 overline (ml) U+0306 breve (ml) U+0308 diaeresis (ta, te) U+030C caron (ml) U+0310 candrabindu (hi) U+0315 comma above right (pa) U+031D up tack below (ta) U+0332 low line (te) U+033A inverted bridge below (ml) U+0346 bridge above (ta) U+0360 double tilde (hi) U+0368 small letter c (hi) U+036B small letter m (hi) U+036C small letter r (hi)
(The script where we saw it used is in parentheses).
With some sample usage:
[A] + [U+300] [A] + [U+301] [Aa] + [U+302] [Bha] + [U+304] [Ma] + [U+305] [Sa] + [U+306] [A] + [U+308] [Va] + [U+30C] [Ma] + [Sign Aa] + [U+310] [Va] + [Sign Aa] + [Ra] + [U+315] [Va] + [Sign Aa] + [Llla] + [Virama] + [Ta] + [Virama] + [Ta] + [Sign U] + [Ka] + [Lla] + [Virama] + [U+31D] [Na] + [Sign Aa] + [Ra] + [Da] + [Sign I] + [U+332] [A] + [Va] + [Ka] + [Sign Aa] + [Sha] + [Anusvara] + [U+33A] [A] + [U+346] [Ka] + [U+360] + [Ra] [Sha] + [Sign Aa] + [U+368] + [Kha] + [Ta] [Pha] + [Sign Aa] + [Ya] + [La] + [Sign O] + [U+36B] + [Dda] + [Ka] [A] + [Va] + [U+36C] + [Dha]
The question is whether these combining characters should be "transparent" for the purposes of syllable identification instead of breaking the syllable.
[Regarding the Latin combining marks] It looks like they occur at the ends of syllables (namely, not between a consonant and a dependent vowel sign), so it seems like skipping them would have no ill effect. Whether or not skipping them results in what the user is expecting to see is, naturally, a different question ... but the user is playing with fire there.
The ones that are combining lower-case letters might be there to serve as footnote/superscript-type annotations ... is that possible to check?
All the combining Latin letters I checked were misencoded pre-base vowel signs, like <U+036B U+0921> for <U+0921 U+093F>.
I googled unassigned code points in the Gurmukhi block and, for some of them, found Unicode Technical Committee proposals for additional characters.
To make progress on Indic shaping we've assembled a corpus of words and syllables by scraping Wikipedia for the ten Indic languages we plan to support (hi.wikipedia.org, bn.wikipedia.org, etc.)
That has given us 22803 unique syllables for Hindi, 10404 for Bengali, and so on, which we can use as test cases for shaping.
The code for this is located at https://github.com/yeslogic/corpus
However we have found some oddities in the Wikipedia text, such as the use of many Indic codepoints that are officially unassigned:
Bengali:
\u{9b1} \u{9b3} \u{9c9} \u{9e4} \u{9e5}
Gurmukhi:
\u{a0b} \u{a0c} \u{a11} \u{a37} \u{a3b} \u{a3d} \u{a43} \u{a52} \u{a53} \u{a54} \u{a58} \u{a5f} \u{a60} \u{a61} \u{a64}
Gujarati:
\u{a92} \u{aa9} \u{ad8} \u{add} \u{ae4} \u{ae5} \u{af3} \u{af5}
Oriya:
\u{b34} \u{b49} \u{b54} \u{b58} \u{b5a} \u{b5b} \u{b5e} \u{b64} \u{b65}
Tamil:
\u{b8b} \u{b96} \u{b97} \u{b98} \u{b9b} \u{b9d} \u{ba0} \u{ba1} \u{ba2} \u{ba5} \u{ba6} \u{ba7} \u{bab} \u{bac} \u{bad} \u{bbc} \u{bc9} \u{be0}
Telugu:
\u{c50} \u{c5b} \u{c64}
Kannada:
\u{cbb} \u{cc9} \u{cf5}
Malayalam:
\u{d49}
Sinhala:
\u{d80} \u{d81} \u{d84} \u{d97} \u{d98} \u{d99} \u{db2} \u{dbc} \u{dbe} \u{dbf} \u{dc7} \u{dc8} \u{dc9} \u{dcb} \u{dcc} \u{dcd} \u{dce} \u{dd5} \u{dd7} \u{de0} \u{de1} \u{de2} \u{de3} \u{de4} \u{de5} \u{df0} \u{df1} \u{df5} \u{df6} \u{df7} \u{df8} \u{df9} \u{dfa} \u{dfb} \u{dfc} \u{dfd} \u{dfe} \u{dff}