sanskrit-lexicon / LRV

Convert the data of L R Vaidya Sanskrit-English dictionary to CDSL format
0 stars 0 forks source link

transliteration error #6

Closed drdhaval2785 closed 2 years ago

drdhaval2785 commented 2 years ago

I am not sure whether this has something to do with indic_transliteration package which I am using.

A

Screenshot_2022-09-21_18-04-53

B

Screenshot_2022-09-21_18-06-29

drdhaval2785 commented 2 years ago

@vvasuki, do you know any reason why this may happen?

vvasuki commented 2 years ago

भ꣡वतः समस्या꣡ ऽस्पस्टा। what was the input?

vvasuki commented 2 years ago

Oh I see - https://github.com/indic-transliteration/indic_transliteration_py/issues/75 . Again underscores what I have said earlier, it is backward and confusing (being polite here) to store devanAgarI data in SLP or whatever approximation, when unicode devanAgarI is perfectly adequate and clear. If someone used devanAgarI to store data in book pages, there is no reason to magically determine that it is inadequate for storing on a hard disk. (Of course, I am not talking about what you use internally in a sandhi package or such.)

drdhaval2785 commented 2 years ago

I understand and understand completely the drawback of storing Devanagari data in SLP1 or any other non-Devanagari encoding. But its practical utilities far outweigh the problem it poses. For example,

  1. There is no terminal which displays the Devanagari text well. I used to use Konsole on ubuntu based system, which used to render Devanagari relatively well, but when I shifted the system, the same package did not render well. Separating.
  2. Separating consonant and vowels from Unicode Devanagari data is a non trivial task, but required very much for string manipulation purposes.

Whereas the drawbacks which I have seen are as follow:

  1. Difficult to comprehend
  2. Not very soothing to the native reader of Sanskrit who would prefer Devanagari, and not soothing to the non-native reader who would prefer IAST
drdhaval2785 commented 2 years ago

Another drawback of processing data in Devanagari unicode is that python2 did not provide support to Devanagari unicode strings natively. One used to do some hecks to make it work, like using codecs package with specifying the encoding etc, or mark the string with u'string'.

I am not sure about other languages. But there may be such cases where ASCII only text is better supported and full unicode range is not supported.

Therefore, to chew unicode data had been more difficult than ASCII only systems, at least for me.

vvasuki commented 2 years ago

Well, these are easily surmountable problems

drdhaval2785 commented 2 years ago

I need to see data printed to terminal for some debugging, every now and then.

convert them just for processing within your scripts; but keep the data unambiguous - an exact (or better) copy of book data.

This is exactly what I did. And that is where this problem arose. Andhrabharati sent data in Devanagari. I converted it to SLP1 in script for computational purposes. Calculations were done, and when I used the same package to convert it back to Devanagari, there was difference in data.

Stop using obsolete tech like python2.

I do not use python2, but needed to support it because the server on which these codes run did not have python3 for quite sometime, or default python was python2. I have not tracked the current status of python on the server. So, we were kind of forced to keep code both python2 and python3 compliant.

drdhaval2785 commented 2 years ago

Let me clarify the situation with clear example

Item to encode in Devanagari

Screenshot_2022-09-22_09-31-32

Case 1 - Wrong way of encoding in my opinion

Devanagari - \u0930\u0943 Devanagari->slp1 by indic_transliteration package - rf slp1->Devanagari by indic_transliteration package - \u0930\u0943

If I encode data wrongly, there is no issue.

Case 2 - Correct way of encoding in my opinion

Devanagari - \u0930\u094d\u090b Devanagari->slp1 - rf slp1->Devanagari - \u0930\u0943

Even though I entered the data with proper encoding in Devanagari, the round-trip gives different data.

I feel that we are discouraging people to encode data properly. At the end I would accept that it is better to encode it as \u0930\u0943 to go away from hassle.

vvasuki commented 2 years ago

When I want to search nirRti, in my head I say "nakaara, ikaara, repha, Rkaara, .." and not "repha virAma RkAra", and so will type \u0930\u0943 , and not insert a virAma in between. If I were to follow your opinion of "correct", the text won't show up in search. In other words, what is intuitively correct to you, is not correct from my view (described here).

The SLP round trip test is no good given it's deficiency. It would be like converting dhaval to tamil and back to get thaval! Also, you should not invent your own encodings - that's why we have "unicode standard". If you are doubtful, post a challenge to the unicode mailing list, and get their feedback.

drdhaval2785 commented 2 years ago

Ok.

For the time being I have applied patches in my code at various places to overcome this problem. So closing this issue. Keeping the indic_transliteration package issue open, lest someone should look at it some day.