sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

A very peculiar error associated with 'accent' marking preceding a vowel #407

Closed Andhrabharati closed 1 year ago

Andhrabharati commented 1 year ago

Just by chance landed at an MW entry, which led me in identifying this error!!

The occurrences of accent marks followed by vowels (in the text data) have few types of display errors like (a) missing letters or (b) taking huge time etc. --if not involving textual corrections [whether typing or printing errors]-- in the following works: CAE, CCS, GRA, INM, MCI, MW, PW, PWG, PWKVN, SCH and STC.

In the INM, MCI etc. the '/' character is not an accent mark, but still the timing issue is noticed.

Incidentally, the first search result display for these is taking "HELL lot of a time" (sorry for the bad wording!!), few tens of seconds to few minutes in some cases [apparently going into an 'eternal' loop].

The KRM, BOP etc. which have the ^X^ as the superscript notation seem to have no such error. ------------------------------------ This case-insensitive regex may be used to get the list-- [^<][/\\\^][fxaiueo] (in SLP1)

[I recall notifying one similar error [involving the '|' (slp) character], which was corrected by @funderburkjim (when a reminder directly addressing to him is posted) after few months' of my posting of the issue in the initial days of my looking at MW at CDSL.]

Andhrabharati commented 1 year ago

I am not sure if this error occurs in the other works as well.

funderburkjim commented 1 year ago

@Andhrabharati Please provide a specific example so I can reproduce the error.

Andhrabharati commented 1 year ago

One quick example for the missing letter after the 'accent marker'--

image

<L>9201<pc>1-0689<k1>AreaGa<k2>Are/aGa
{#Are/aGa#}¦ ({#Are + aGa#}) <lex>adj.</lex> {%wovon Uebel fern ist%}: {#A\rea^GA a\sme Ba\drA sO^Srava\sAni^ santu#} 
<ls>ṚV. 6,1,12.</ls> {#sva\stim#} 
<ls n="ṚV. 6,">56,6.</ls>
<LEND>

image

[As already indicated earlier, I have seen that many a times the regex (as given above) results needed textual corrections in the files.]

Andhrabharati commented 1 year ago

image

image

image

funderburkjim commented 1 year ago

This problem is peculiar to the PW dictionaries. Here is a little test:

slp1 = Are/agra, deva = आरे॑अग्र, deva1 = आरे꣫ग्र
slp1 = ita/Uti, deva = इत॑ऊति, deva1 = इत꣫ूति
slp1 = go/agra, deva = गो॑अग्र, deva1 = गो꣫ग्र
slp1 = go/fjIka, deva = गो॑ऋजीक, deva1 = गो꣫ृजीक

Based on the small test:

My memory is that the PWx transcoding was developed in order to display the accents (notably udAtta) in the manner of Boetlingk.

@Andhrabharati Can you find a link to the repository and issue where this PWG devanagari was discussed?

The task now is to correct slp1_deva1.xml.

Andhrabharati commented 1 year ago

is this (https://github.com/sanskrit-lexicon/PWG/issues/5#issuecomment-900759930) the one you wanted, @funderburkjim ?

or this one-- (https://github.com/sanskrit-lexicon/PWG/issues/5#issuecomment-895404247)?

Andhrabharati commented 1 year ago

BTW, my above post is not just about the missing letters, it is about the typo/print errors, and the timing issue as well, which I had seen in many works as listed above.

See for example,

image

and

image

[the errors are either in the metaline or the following headerline HW entry; and sometimes in the body matter as well.]

funderburkjim commented 1 year ago

changes to slp1_deva1.xml

After this change, the little test looks correct for deva1

slp1 = Are/agra, deva = आरे॑अग्र, deva1 = आरे꣫अग्र
slp1 = ita/Uti, deva = इत॑ऊति, deva1 = इत꣫ऊति
slp1 = go/agra, deva = गो॑अग्र, deva1 = गो꣫अग्र
slp1 = go/fjIka, deva = गो॑ऋजीक, deva1 = गो꣫ऋजीक

The pwg, pw, and pwkvn displays are changed, including in simple-search

image

I will consider this part of the n-fold problem of this issue finished.

funderburkjim commented 1 year ago

[^<][/\\\^][fxaiueo] corrections in MW

27 such lines found, and corrected (see csl-orig commit above).

2 of these 27 required no correction.

funderburkjim commented 1 year ago

(b) taking huge time etc.

@Andhrabharati please provide a specific example (or a couple of examples) so I can reproduce the problem.

funderburkjim commented 1 year ago

@Andhrabharati thanks for the two PWG accent Devanagari references. These were what I was looking for.

funderburkjim commented 1 year ago

count of the regex

count.txt counts the instances matching the regex [^<][/\\^][fxaiueoFXAIUEO] in the 37 dictionaries of csl-orig.

There are 0 instances in 19 of the dictionaries. A next step would be to look for errors (and non-error patterns) in the others .

Andhrabharati commented 1 year ago

The count can be reduced further, by adding the (negating)caret after the [fxaieou], which is for the superscript notation (^X^). Also add a numeral in the initial brace.

[^<0-9][/\\^][fxaiueoFXAIUEO][^\^]

BTW, I think the GRA instances are mostly print errors, not typos. They seem to have the accent mark preceding the vowel, not after the vowel (which is the regular way).

Andhrabharati commented 1 year ago

(b) taking huge time etc.

@Andhrabharati please provide a specific example (or a couple of examples) so I can reproduce the problem.

@funderburkjim Now I could not get the error, which I had noticed for the words with the earlier regex.

I did notice it for two days when I posted the issue; and the search for any other word(s) was instantaneous at CDSL and every other site was normal. So, I am sure it was not a network/connection issue at that time.

Probably, we may ignore the timing issue for now. [If it ever occurs again, it can then be looked into.]

funderburkjim commented 1 year ago

changes completed

Changes made are in the various files in the directory issues/407/changes/.