Closed drdhaval2785 closed 2 years ago
@vvasuki, do you know any reason why this may happen?
भ꣡वतः समस्या꣡ ऽस्पस्टा। what was the input?
Oh I see - https://github.com/indic-transliteration/indic_transliteration_py/issues/75 . Again underscores what I have said earlier, it is backward and confusing (being polite here) to store devanAgarI data in SLP or whatever approximation, when unicode devanAgarI is perfectly adequate and clear. If someone used devanAgarI to store data in book pages, there is no reason to magically determine that it is inadequate for storing on a hard disk. (Of course, I am not talking about what you use internally in a sandhi package or such.)
I understand and understand completely the drawback of storing Devanagari data in SLP1 or any other non-Devanagari encoding. But its practical utilities far outweigh the problem it poses. For example,
Konsole
on ubuntu based system, which used to render Devanagari relatively well, but when I shifted the system, the same package did not render well.
Separating.Whereas the drawbacks which I have seen are as follow:
Another drawback of processing data in Devanagari unicode is that python2
did not provide support to Devanagari unicode strings natively. One used to do some hecks to make it work, like using codecs
package with specifying the encoding etc, or mark the string with u'string'
.
I am not sure about other languages. But there may be such cases where ASCII only text is better supported and full unicode range is not supported.
Therefore, to chew unicode data had been more difficult than ASCII only systems, at least for me.
Well, these are easily surmountable problems
I need to see data printed to terminal for some debugging, every now and then.
convert them just for processing within your scripts; but keep the data unambiguous - an exact (or better) copy of book data.
This is exactly what I did. And that is where this problem arose. Andhrabharati sent data in Devanagari. I converted it to SLP1 in script for computational purposes. Calculations were done, and when I used the same package to convert it back to Devanagari, there was difference in data.
Stop using obsolete tech like python2.
I do not use python2, but needed to support it because the server on which these codes run did not have python3 for quite sometime, or default python was python2. I have not tracked the current status of python on the server. So, we were kind of forced to keep code both python2 and python3 compliant.
Let me clarify the situation with clear example
Devanagari - \u0930\u0943
Devanagari->slp1 by indic_transliteration package - rf
slp1->Devanagari by indic_transliteration package - \u0930\u0943
If I encode data wrongly, there is no issue.
Devanagari - \u0930\u094d\u090b
Devanagari->slp1 - rf
slp1->Devanagari - \u0930\u0943
Even though I entered the data with proper encoding in Devanagari, the round-trip gives different data.
I feel that we are discouraging people to encode data properly.
At the end I would accept that it is better to encode it as \u0930\u0943
to go away from hassle.
When I want to search nirRti, in my head I say "nakaara, ikaara, repha, Rkaara, .." and not "repha virAma RkAra", and so will type \u0930\u0943 , and not insert a virAma in between. If I were to follow your opinion of "correct", the text won't show up in search. In other words, what is intuitively correct to you, is not correct from my view (described here).
The SLP round trip test is no good given it's deficiency. It would be like converting dhaval to tamil and back to get thaval! Also, you should not invent your own encodings - that's why we have "unicode standard". If you are doubtful, post a challenge to the unicode mailing list, and get their feedback.
Ok.
For the time being I have applied patches in my code at various places to overcome this problem. So closing this issue. Keeping the indic_transliteration package issue open, lest someone should look at it some day.
I am not sure whether this has something to do with indic_transliteration package which I am using.
A
B