Wikipedia issues - Githubissues

mikeday commented 5 years ago

To make progress on Indic shaping we've assembled a corpus of words and syllables by scraping Wikipedia for the ten Indic languages we plan to support (hi.wikipedia.org, bn.wikipedia.org, etc.)

That has given us 22803 unique syllables for Hindi, 10404 for Bengali, and so on, which we can use as test cases for shaping.

The code for this is located at https://github.com/yeslogic/corpus

However we have found some oddities in the Wikipedia text, such as the use of many Indic codepoints that are officially unassigned:

Bengali:

\u{9b1} \u{9b3} \u{9c9} \u{9e4} \u{9e5}

Gurmukhi:

\u{a0b} \u{a0c} \u{a11} \u{a37} \u{a3b} \u{a3d} \u{a43} \u{a52} \u{a53} \u{a54} \u{a58} \u{a5f} \u{a60} \u{a61} \u{a64}

Gujarati:

\u{a92} \u{aa9} \u{ad8} \u{add} \u{ae4} \u{ae5} \u{af3} \u{af5}

Oriya:

\u{b34} \u{b49} \u{b54} \u{b58} \u{b5a} \u{b5b} \u{b5e} \u{b64} \u{b65}

Tamil:

\u{b8b} \u{b96} \u{b97} \u{b98} \u{b9b} \u{b9d} \u{ba0} \u{ba1} \u{ba2} \u{ba5} \u{ba6} \u{ba7} \u{bab} \u{bac} \u{bad} \u{bbc} \u{bc9} \u{be0}

Telugu:

\u{c50} \u{c5b} \u{c64}

Kannada:

\u{cbb} \u{cc9} \u{cf5}

Malayalam:

\u{d49}

Sinhala:

\u{d80} \u{d81} \u{d84} \u{d97} \u{d98} \u{d99} \u{db2} \u{dbc} \u{dbe} \u{dbf} \u{dc7} \u{dc8} \u{dc9} \u{dcb} \u{dcc} \u{dcd} \u{dce} \u{dd5} \u{dd7} \u{de0} \u{de1} \u{de2} \u{de3} \u{de4} \u{de5} \u{df0} \u{df1} \u{df5} \u{df6} \u{df7} \u{df8} \u{df9} \u{dfa} \u{dfb} \u{dfc} \u{dfd} \u{dfe} \u{dff}

mikeday commented 5 years ago

We have also identified many uses of Latin combining characters applied to Indic text:

U+0300 grave (ta) U+0301 acute (ta, ml) U+0302 circumflex (ta) U+0304 macron (ml) U+0305 overline (ml) U+0306 breve (ml) U+0308 diaeresis (ta, te) U+030C caron (ml) U+0310 candrabindu (hi) U+0315 comma above right (pa) U+031D up tack below (ta) U+0332 low line (te) U+033A inverted bridge below (ml) U+0346 bridge above (ta) U+0360 double tilde (hi) U+0368 small letter c (hi) U+036B small letter m (hi) U+036C small letter r (hi)

(The script where we saw it used is in parentheses).

With some sample usage:

[A] + [U+300] [A] + [U+301] [Aa] + [U+302] [Bha] + [U+304] [Ma] + [U+305] [Sa] + [U+306] [A] + [U+308] [Va] + [U+30C] [Ma] + [Sign Aa] + [U+310] [Va] + [Sign Aa] + [Ra] + [U+315] [Va] + [Sign Aa] + [Llla] + [Virama] + [Ta] + [Virama] + [Ta] + [Sign U] + [Ka] + [Lla] + [Virama] + [U+31D] [Na] + [Sign Aa] + [Ra] + [Da] + [Sign I] + [U+332] [A] + [Va] + [Ka] + [Sign Aa] + [Sha] + [Anusvara] + [U+33A] [A] + [U+346] [Ka] + [U+360] + [Ra] [Sha] + [Sign Aa] + [U+368] + [Kha] + [Ta] [Pha] + [Sign Aa] + [Ya] + [La] + [Sign O] + [U+36B] + [Dda] + [Ka] [A] + [Va] + [U+36C] + [Dha]

The question is whether these combining characters should be "transparent" for the purposes of syllable identification instead of breaking the syllable.

n8willis commented 5 years ago

[Regarding the Latin combining marks] It looks like they occur at the ends of syllables (namely, not between a consonant and a dependent vowel sign), so it seems like skipping them would have no ill effect. Whether or not skipping them results in what the user is expecting to see is, naturally, a different question ... but the user is playing with fire there.

The ones that are combining lower-case letters might be there to serve as footnote/superscript-type annotations ... is that possible to check?

dscorbett commented 5 years ago

All the combining Latin letters I checked were misencoded pre-base vowel signs, like <U+036B U+0921> for <U+0921 U+093F>.

epigraphic commented 4 years ago

I googled unassigned code points in the Gurmukhi block and, for some of them, found Unicode Technical Committee proposals for additional characters.

n8willis / opentype-shaping-documents

Wikipedia issues #37