keymanapp / keyboards

Open Source Keyman keyboards
152 stars 275 forks source link

[itrans_devanagari_hindi] [itrans_bengali] Indic Phonetic ITRANS keyboards bug? (Devanagari and Bangla) #2249

Open milindchakraborty opened 1 year ago

milindchakraborty commented 1 year ago

I use the Hindi Devanagari Phonetic (ITRANS) keyboard, Vedic Sanskrit Devanagari Phonetic (ITRANS) for Devanagari, and Bengali Phonetic (ITRANS) for Bangla. I just noticed that the first of the following problems extends to other ITRANS keyboards too.

  1. I want to know how to use the punctuation symbols [, ], {, }, \, | in the above keyboard instead of rendering them as ऎ, ꣾ, ऒ, ॵ, ॲ without turning off the keyboard since they are encoded with these Unicode points on the keyman keyboard. Similarly how do I use #, $, %, ^, &, *, _, + without rendering ॐ, ₹, ࿕ (also, what is this?), ्, :wheel_of_dharma:, ❀? (In case, this is a bug and can’t be done, I suggest " ` + the said character " to get an unrendered character like how Kannada WinScript (NLCI) does, or something similar.) I can’t type #, $, %, ^, &, *, _, + in Bangla either, I get ওঁ, ₹, ࿕, ্, :wheel_of_dharma:, ❀ ‍. So I think this needs to be looked at on a generic scale.

  2. Another bug that I want to bring to your notice is… The inputs like .D, .Dh, .y, .k, .kh, .g, .j, .f, .n, .r, .L render ড়, ঢ়, য় (Bangla), ड़, ढ़, य़, क़, ख़, ग़, ज़, फ़, ऩ, ऱ, ऴ (Devanagari) as "letter + nuqta" although many of these letters are individual unique points in Unicode and therefore it is unnecessarily rendering two characters instead, which is a practice Unicode disapproves of. In Bangla it renders, ড/ঢ/য + ় (U+09BC) and for Devanagari it renders ड/ढ/य/क/ख/ग/ज/फ/न/र/ळ + ़ (U+093C). I believe this bug needs to be fixed so that one can input the following characters as single Unicode points instead… . In Devanagari, क़ (U+ 0958), ख़ (U+ 0959), ग़ (U+ 095A), ज़ (U+ 095B), ड़ (U+ 095C), ढ़ (U+ 095D), फ़ (U+ 095E), य़ (U+095F), ऩ (U+ 0929), ऱ (U+ 0931), ऴ (U+ 0934) Some more letters that are needed व़ (व + ़), ट़ (ट + ़), द़ (द + ़), त़ (त + ़), थ़ (थ + ़), च़ (च + ़), छ़ (छ + ़); honestly apart from the characters above which have unique unicode points, a feature to add nuqta to the rest of the letters if the need to use arises, should be incorporated. They are often used for phonologically accurate representations of various languages. . In Bangla, ড় (U+09DC), ঢ় (U+ 09DD), য় (U+ 09DF), জ় (জ + ়), ফ় (ফ + ়), ক় (ক + ়), খ় (খ + ়), গ় (গ + ়), , ল় (ল + ়), ভ় (ভ + ়) [Kindly add an option to add nuqta to other letters too, if the need to phonetically represent arises for one; for example, জ় isn't supported in Bangla but it's often used by media houses like Anandabazar Patrika; ল়, ভ় are used to represent the retroflex l and the English v sounds in phonological representation.] The vowel signs should add on to nuqta-ed letter if we type say, .li or .ve as ল়ি, व़े.

  3. Also in ITRANS Bangla, how do we type র‍্য (Bengali Ra + ZWJ + VIRAMA + Bengali Ya) and অ্য (Bengali A + VIRAMA + Bengali Ya)? র‍্যা (ræ) and অ্যা (æ) are very necessary character sequences in Bangla, thus র‍্য and অ্য are necessary; I suggest r+y for র‍্য and a+y for অ্য.

  4. In the Hindi keyboard, in Sanskrit mode (ctrl+alt+0), if we type nj we get ञ्ज्; if we type nc, we get nc (I don't know why); but if we type nch we get ञ्च्; I want to know why don't we have short keys for ञ्छ् and ञ्झ्.

mcdurdin commented 1 year ago

@Shreeshrii: as these are your keyboards, would you like to respond to this feedback?

MakaraSok commented 1 year ago

Original post: https://community.software.sil.org/t/hindi-devanagari-phonetic-itrans-and-vedic-sanskrit-devanagari-phonetic-itrans-bug/7580

devosb commented 1 year ago

It should not matter if one or two codepoints are used for nukta forms. Unicode specifies that both forms are canonically identical. In the case of Indic nuktas, even NFC (which usually generates a composed form, if possible) decomposes the single codepoint nukta forms. That is, NFC(U+0958) = NFC(U+0915, U+093C) = U+0915, U+093C.

For Indic vowel signs, generally the multi-part vowel signs, such as U+09CB, are split into parts (U+09C7, U+09BE) for rendering. So I don't think Unicode would have any problem with using two codepoints even if a single codepoint was available. I have heard, years ago, of some Latin script spellcheckers not working with decomposed characters (two codepoints in that case) when the spellcheckers would work with a single codepoint. But that is a bug in the application, not a preference of Unicode.

milindchakraborty commented 1 year ago

It should not matter if one or two codepoints are used for nukta forms. Unicode specifies that both forms are canonically identical. In the case of Indic nuktas, even NFC (which usually generates a composed form, if possible) decomposes the single codepoint nukta forms. That is, NFC(U+0958) = NFC(U+0915, U+093C) = U+0915, U+093C.

For Indic vowel signs, generally the multi-part vowel signs, such as U+09CB, are split into parts (U+09C7, U+09BE) for rendering. So I don't think Unicode would have any problem with using two codepoints even if a single codepoint was available. I have heard, years ago, of some Latin script spellcheckers not working with decomposed characters (two codepoints in that case) when the spellcheckers would work with a single codepoint. But that is a bug in the application, not a preference of Unicode.

The problem is not whether visually using a nuqta for characters like ড়, ঢ়, য় is a problem, but that in Bangla, these aren't nuqta but independent characters that are officially separately listed in the alphabet as early as the nineteenth century. ড-ড়, ঢ-ঢ়, য-য় are different letters and aren't a result of adding nuqta to the former ones. In Hindi, using separately nuqta-ed क़, ख़, ग़, ज़, फ़ I can understand since, as per the standardisations by Central Hindi Directorate, these have been rendered redundant, so if one uses the word with or without nuqta, dictionaries are to show them as the same result; hence keeping the nuqta separate for these five letters can be understood from that standpoint, just for Hindi. Both फ़रिश्ता, ज़मीन and फरिश्ता, जमीन are to be considered equally valid according to them. And this is only the argument for Hindi; many other scripts using Devanagari do maintain the difference and they can't use them interchangeably and interpret the pairs as completely unrelated sounds. Of course, for ड़ and ढ़, this does not apply in Hindi either, and neither for the Bangla ড়, ঢ়, য়. Note that, ভয়, আড়, দৃঢ় can never be typed or interpreted as ভয, আড, দৃঢ. Moreover, one has to understand that the concept of nuqta (like z, f, ḷ) is new to Bengali to introduce a few extra phonemes and the aforementioned letters aren't nuqta-ed letters. Second, this creates too many anomalies in digital texts especially in the case of digital corpora when some people use a single codepoint while others use a pair to write the same word making NLP erroneous. Another, rather more problematic, example of this is if one uses গ + ো = গো as গ + ে + া = গো. Also in the case of Bangla, ড়, ঢ়, য় being rendered with a nuqta is plainly wrong.

LornaSIL commented 1 year ago

It would be easy to fix, but I'd prefer the keyboard author to decide what should happen.

milindchakraborty commented 1 year ago

It would be easy to fix, but I'd prefer the keyboard author to decide what should happen.

Are there any updates happening?

LornaSIL commented 11 months ago

Are there any updates happening? I have not yet heard back. Sorry!

mcdurdin commented 11 months ago

I see @Shreeshrii has not been active on GitHub since May 2022, so we may need to move forward without their involvement?

LornaSIL commented 11 months ago

@milindchakraborty This was more difficult than I anticipated. I admit I hadn't looked at the source files :)

If you would be willing to test what I've done, you can download the itrans_devanagari_hindi.kmp from my dropbox.

I didn't do everything you asked because I'm a bit afraid of breaking the existing implementation. What I've done:

Let me know how it goes. If this one looks good, then I would try to address the bangla one.

gsghyd commented 2 days ago

These problems were anticipated and avoided more than a decade ago in ISIS Bangla/Bengali (deprecated since) and are also absent from Gautami Bangla/Bengali: https://keyman.com/keyboards/gautami_bangla_bengali.