indic-transliteration / indic_transliteration_py

Python package for indic script transliteration
MIT License
166 stars 33 forks source link

Loss of invertibility for र्ऋ (slp1 'rf') #75

Open drdhaval2785 opened 2 years ago

drdhaval2785 commented 2 years ago

What I did

from indic_transliteration import sanscript

s = 'नैर्ऋती'
print(s)
s1 = sanscript.transliterate(s, 'devanagari', 'slp1')
print(s1)
s2 = sanscript.transliterate(s1, 'slp1', 'devanagari')
print(s2)

Result

नैर्ऋती
nErftI
नैरृती

Display of these two Devanagari words differ across different browser / devices. Therefore copied it at https://unicode.scarfboy.com/?s=%E0%A4%A8%E0%A5%88%E0%A4%B0%E0%A5%8D%E0%A4%8B%E0%A4%A4%E0%A5%80++%E0%A4%A8%E0%A5%88%E0%A4%B0%E0%A5%83%E0%A4%A4%E0%A5%80

The output is as follows Screenshot_2022-09-21_18-12-49

The result should be identical

vvasuki commented 2 years ago

This is not a transliteration (→ production of correct unicode devanAgarI encoding) problem. Rather it is a deficiency of the encoding you are using (SLP).

You see, र्+ऋ (ra virAma ऋ ) and र + ृ (ra + R-mAtrA) have exactly the same encoding. If you want to distinguish between them, don't use a deficient encoding system. (नैरृत) is the phonetically correct sequence:

U+0928 : DEVANAGARI LETTER NA
U+0948 : DEVANAGARI VOWEL SIGN AI
U+0930 : DEVANAGARI LETTER RA
U+0943 : DEVANAGARI VOWEL SIGN VOCALIC R
U+0924 : DEVANAGARI LETTER TA
drdhaval2785 commented 2 years ago

I think I couldn't make the question clear enough. This is not about the deficiency of SLP1 encoding system. It is about a glyph which is non-existent in Devanagari script, but was introduced into Devanagari because of Unicode. 'rf' was and has been always written as र्+ऋ (ra virAma ऋ) in Devanagari, and has never been written as र + ृ (ra + R-mAtrA) There is no pre-unicode book which is known to have used र + ृ (ra + R-mAtrA). I would be happy to know such an occurence, if any.

Therefore, what I am asking for is that the transliteration package should

  1. convert rf to र्+ऋ (ra virAma ऋ) (slp1->Devanagari)
  2. convert र्+ऋ (ra virAma ऋ) and र + ृ (ra + R-mAtrA) to 'rf' (Devanagari->slp1)
vvasuki commented 2 years ago

As per unicode conventions, र्+ऋ is represented by र + ृ (ra + R-mAtrA), same as any other consonant - vowel combination. It does not make any sense whatsoever to make an exception just for र + ृ, as far as representation in terms of bits goes. The job of a transliteration package is to give the correct bits as per unicode - that's it. How these bits are shown as pixels on your screen is the job of "font" and "renderer" - well beyond the scope of this project. So, if you feel strongly about how this is shown, you should raise bugs against font and redering engines. (Please close if you are convinced.)

drdhaval2785 commented 2 years ago

It is not about the rendering part that I am concerned. I am concerned about the actual stored data. In a lossless transliteration scheme, I would reasonably expect the round-trip to return the same result.

The supposed र + ृ (ra + R-mAtrA) way of writing does not exist in Devanagari as far as I know.

In the present status of transliteration package, If I feed wrong, non-existent way of representation, it is totally reversible. if I feed correct way of writing the data, it becomes irreversible.

drdhaval2785 commented 2 years ago

If anyone can show any pre-unicode book which has used the glyph to represent ‘rf’, I would rest my case.

vvasuki commented 2 years ago

If anyone can show any pre-unicode book which has used the glyph to represent ‘rf’, I would rest my case.

we're arguing about slightly different cases then :-)

My case: unicode convention is fine. showing on screen is not unicode's business. This package does it's job properly (converting to unicode devanAgarI).

Your case: I don't like the way my computer shows this rf. It is unicode's fault.

Please tell switch to my case, as described above and argue for or against it.