Transcoding Bug? - Githubissues

funderburkjim commented 2 years ago

In two versions (L0 or L1) and consolidated (L2) versions of INM, there is a difference in the Devanagari spelling for 6 metalines:

L0 : <L>7504<pc>496-1<k1>नैरृत<k2>नैरृत  and similarly for 7505, 7506
L2 : <L>7504<pc>496-1<k1>नैर्ऋत<k2>नैर्ऋत  and similarly for 7505, 7506

L0: <L>7507<pc>496-2<k1>नैरृति<k2>नैरृति 
L2; <L>7507<pc>496-2<k1>नैर्ऋति<k2>नैर्ऋति  

L0: <L>7713<pc>519-1<k1>निरृति<k2>निरृति<h>1   and similarly for 7714
L2: <L>7713<pc>519-1<k1>निर्ऋति<k2>निर्ऋति<h>1

The transcoding to SLP1 is the same for each Devanagari:

नैरृत -> nErfta -> नैरृत
नैर्ऋत -> nErfta -> नैरृत

The fact that 2 Devanagari spellings get converted to the same slp1 spelling breaks invertibility of transcoding.

Are the two Devanagari spellings 'equivalent' or do they represent different Sanskrit words?

What other pairs of Devanagari Unicode strings transcode to the same slp1 spelling?

Should we consider this to be a bug in the transcoding file deva_slp1.xml ?
If so, what is the solution?

Andhrabharati commented 2 years ago

You had landed on a good point @funderburkjim!

I would like to bring your attention to one of my recent posts on the same point- https://github.com/sanskrit-lexicon/csl-devanagari/issues/34#issuecomment-968267219

Before I say something, or we decide to do something, I request you to kindly post a message to Peter Scharf himself about this matter.

Andhrabharati commented 2 years ago

From my side, I had taken that the L2 text is the correct way as per the customary (but undocumented) norm.

Andhrabharati commented 2 years ago

And, this is one of the very very few Classical (post-Vedic) Sanskrit words where a vowel letter (here, ऋ) appears inside (i.e., not at the beginning) of a word. [AFAIK, the Vedic Sanskrit has quite many words of such type; but it a completely different "domain" altogether.]

funderburkjim commented 2 years ago

Yes, I will request Peter's input on this.
I know you only by user name @Andhrabharati . What is your name?

Andhrabharati commented 2 years ago

And the word formation is thus- निः + ऋत > निर् + ऋत > निर्ऋत.

Andhrabharati commented 2 years ago

Yes, I will request Peter's input on this. I know you only by user name @Andhrabharati . What is your name?

K Nagabhushana Rao, and my mail id is knbrao@gmail.com

funderburkjim commented 2 years ago

Ok. Thanks Mr. Rao.

Andhrabharati commented 2 years ago

The fact that 2 Devanagari spellings get converted to the same slp1 spelling breaks invertibility of transcoding.

As I understand this issue is due to the reason that slp1 has no separate encoding/denoting the mAtra character, but just uses the corresponding vowel letter.

I had brought this point to the notice of @drdhaval2785 sometime back, when a separate 'i' mAtra got wrongly converted to the vowel in slp1 (base cologne text) to Devanagari (Andhrabharati correction) back to slp1 (Dhaval's back conversion) in a PWG entry.

Andhrabharati commented 2 years ago

It was during the time when I was filling the '??' missing portions in all CDSL texts.

Andhrabharati commented 2 years ago

And this is the mail transaction on the point-

'------------------

good, so I got some brain-storming challenging work for the 'invertibility school' (i.e. Jim and you) now.

so far as I remember in what I read, slp1 talks only about vowels, consonants and accents; but not about the mAtrAs. (may I might've missed, if it is mentioned somewhere.)

and though rare in dictionaries, the mAtrAs need to be handled, esp. in textbooks etc. which are the basic learning books.

On Wed, 8 Sep 2021, 17:27 Dhaval Patel, drdhaval2785@gmail.com wrote:

It is one of those irreversible changes while going to Devanagari and coming back. Will have to work out something to avoid such error again.

Dr. Dhaval Patel

On Wed, 8 Sep 2021, 17:20 Nagabhushana Rao K, knbrao@gmail.com wrote:

Just seen in the updated list of github repo-

1217767-1796राणिराणि Old: {#राणि#}¦, hier und in {#पैलादि#} ist der Haken über dem ({#ि#}) abgebrochen. New: {#राणि#}¦, hier und in {#पैलादि#} ist der Haken über dem ({#इ#}) abgebrochen. (The wording means "the hook above is broken") Here the correction is talking about the missing mAtrA mark, not the vowel. See the correction text in V.7 and the existing text in V.6-333 that is referred to BTW, I did not get a message for this, just happened to look at this accidentally after the repo update at my local desktop.

Andhrabharati commented 2 years ago

And then I had suggested this-

Just one idea; if cologne processes agree to have hex character notation, this ' ि ' might be indicated as ' &#x93F' and should be possible to retain it across all conversions.

Though it looks a bit odd in the text file, the html display would be alright.

funderburkjim commented 2 years ago

Here is curious example with two different fonts.

Look up 'nirfta' (slp1) in ap90, with Devanagari output This is using Siddhanta1 font.

Now copy and paste the devanagari from that display into this comment, and make a snapshot:

निरृत

HERE IS THE IMAGE: AFTER IMAGE: I am using Edge browser, and inspecting the devanagari shows NIRMALA UI font being used.

There appear to be some gremlins somewhere playing games with us!

funderburkjim commented 2 years ago

unicode comparison

The devanagari strings are copied from the first example of the first comment above. and pasted into a python program which shows each unicode character.

INPUT = नैर्ऋति
0928 | न | DEVANAGARI LETTER NA
0948 | ै | DEVANAGARI VOWEL SIGN AI
0930 | र | DEVANAGARI LETTER RA
094D | ् | DEVANAGARI SIGN VIRAMA     <<<< This is the difference
090B | ऋ | DEVANAGARI LETTER VOCALIC R
0924 | त | DEVANAGARI LETTER TA
093F | ि | DEVANAGARI VOWEL SIGN I

INPUT = नैरृति
0928 | न | DEVANAGARI LETTER NA
0948 | ै | DEVANAGARI VOWEL SIGN AI
0930 | र | DEVANAGARI LETTER RA
0943 | ृ | DEVANAGARI VOWEL SIGN VOCALIC R
0924 | त | DEVANAGARI LETTER TA
093F | ि | DEVANAGARI VOWEL SIGN I

Andhrabharati commented 2 years ago

yes; the virAma character is inserted by me to retain the vowel character as is, instead of being treated as a mAtra character.

funderburkjim commented 2 years ago

Does SLP1 have a character for Virama?
The current transcoding from slp1 to Devanagari inserts a DEVANAGARI SIGN VIRAMA where appropriate.

Looking at https://sanskritlibrary.org/Sanskrit/pub/lies_sl.pdf at pdf p. 173 and the table at pdf p. 218, it appears that SLP1 uses the exclamation mark ! for virama.

The current transcoding xml files do not currently use ! at all. It might be possible to augment slp1_deva.xml so that नैर्ऋति (first example above) would correspond to 'nEr!fta`.

Not sure if this augmentation would be simple, or if it is worthwhile to spend time perfecting slp1_deva.xml and the inverse deva_slp1.xml.

drdhaval2785 commented 2 years ago

I think it is not about inserting a viraama.

It is about indiosyncracy of fonts. Siddhanta shows the same Devanagari representation well, and some other font does not. I guess we should not add viraama where none exists actually, just to educate a dumb font. Better use a good font.

Andhrabharati commented 2 years ago

I go with Dhaval's suggestion.

And this is clearly visible in the screenshot in my cited post-- the rendering is good in one font and not in another font (highlighted) in the same screen!

funderburkjim commented 2 years ago

Also of interest is Apte's Devanagari

As shown above, this agrees with the glyphs generated by Siddhanta1 font from the unicode string without virama. This consonant with comments from @drdhaval2785 and @Andhrabharati above.

Peter commented

proper Devanagari ... should not use a virama except at the end of speech.

Peter also offered to share his current transcoding files (slp1_deva.xml and deva_slp1.xml). and I requested he also share the corresponding Java code from Ralph Bunker which his software uses to interpret the xml. Peter thinks that the non-invertibility that my PHP/Python implementation encounters in this example is solved by Ralph's implementation. It would be good if my implementation agrees with Ralph's. Time will tell whether this compatibility can be accomplished.

I don't see the need for doing anything immediately on this issue, so am closing it.

gasyoun commented 2 years ago

we should not add viraama where none exists actually, just to educate a dumb font

Agree.

sanskrit-lexicon / INM

Transcoding Bug? #3

And then I had suggested this-

unicode comparison