sanskrit-lexicon / CORRECTIONS

Correction history for Cologne Sanskrit Lexicon
8 stars 5 forks source link

PW IAST corrections #419

Open funderburkjim opened 6 years ago

funderburkjim commented 6 years ago

In the PW dictionary, a relatively small number of words appear in IAST spellings; for examole image

Some of these have spelling errors in the Cologne digitization: image

This issue is devoted to correcting such spelling errors.

gasyoun commented 6 years ago

Some of these have spelling errors

This is a result of manual checking, right?

funderburkjim commented 6 years ago

<is> tag.

These cases are identified in the current digitization by the <is> tag. The reason Thomas originally coded these words is that, as the print example shows, they appear with wide letter spacing. Thomas original coding was converted to the current <is> xml-type tag: <is>Agastiya</is>.

the cases of <is> tag

There are 4858 distinct text instances of the <is> tag. We want to find spelling errors. It is expected that many of these 4858 instances are spelled correctly. One way to make a separation into cases which are probably correctly spelled and cases which possibly are incorrectly spelled is to make use of a list of known correctly spelled words. For this purpose, we are using the headwords of MW (193,000 distinct such headwords).

After converting the IAST words to lower case, and then transcoding from IAST to slp1, we can compare to the list of MW headwords. The result is that 3273 of the words are recognized as MW headwords (therefore probably correctly spelled) 1585 of the words are not so recognized, and therefore need further examination.

These two lists are in this gist

Each line shows

There is also an html file for the nonmw list. This contains a link to PW basic display for each PW headword where the questionable IAST spelling occurs.

funderburkjim commented 6 years ago

Suggestion for correction

Make a local copy of the pwis_notmw.txt file, and also of the pwis_notmw.html file.

Indicate corrections in the pwis_notmw.txt file by adding a 4th field with the correct spelling in SLP1 form.

Post processing program can convert the SLP1 correction back to IAST. It is probably easier (for @drdhaval2785 , at least) to enter the correction in SLP1 rather than the diacritics required in many of the IAST spellings).

Then submit back to me the corrected file. I'll convert these to standard 'updateByLine' old/new corrections for PW, and install the corrections.

Don't worry about whether the correction is a typo or print error. Probably almost all are typos.

drdhaval2785 commented 6 years ago

One more possibility to reduce the list.

  1. Unique german (or french?) tendency to use 'k' instead of 'c'.

E.g. pracetas - praketas paYcagavya - paNkagavya etc.

If we make replacement from k to c and find the word in MW headword list, it can be listed as auto corrected.

More observations to reduce list will be enumerated as and when I encounter such tendencies which are manageable programmatically

drdhaval2785 commented 6 years ago

screenshot_20180723-142317_samsung internet

funderburkjim commented 6 years ago

Autocorrection 1

pwis_notmw1.txt has been added to the gist.

This contains the same list of 1585 words as in pwis_notmw.txt , but with 179 autocorrections. The autocorrections are generated by the rules:

These rules were applied to slp1 spelling of each of the 1585; if one of the rules resulted in a new spelling which matched an MW headword, then this was indicated in the output (pwis_notmw1.txt) by

@drdhaval2785 This should help a bit, by autocorrecting 11% of the cases. You could download the pwis_notmw1.txt and work from it.

gasyoun commented 6 years ago

'k4'; if the typist missed the accent, it would be just k

That explains a lot.

11% of the cases

Well done, well done.

Dhaval, thanks again for being back. This one still remains the major dictionary. Not widely used in India, because people tend to forget German, but the most academic one up to now.

funderburkjim commented 6 years ago

Autocorrection 2

This is based on an idea in article How to Write a Spelling Corrector by Peter Norvig.

Consider example of yAjNavalkya, in slp1 spelling.

The idea is

  1. Find candidate spellings which are an 'edit distance' of 1 from the original spelling. (i.e., by replacing one character, removing one character, or inserting one character) There are 1127 such spellings, mostly nonsense: aAjNavalkya, yaAjNavalkya, yjNavalkya, etc.
  2. check each of these spellings against list of known MW headwords.
    • Declare success (and mark as (Auto1)) if there is exactly 1 known spelling among the candidates
    • Declare possible success (and mark as (Auto1X)) if there are more than 1 known spellings among the candidates

Results:

See pwis_notmw2.txt

gasyoun commented 6 years ago

N 1 n X,O,x,o,F,f,nI,nO,nA,nE,an,in,nU,ni,na,nf,nu,A,E,I,U,a,e,nF,i,no,u (Auto1X)

How can one help here?

Visṇu 1 visRu visru,vizRu (Auto1X) Vrṣṇi 1 vrzRi vArzRi,vfzRi (Auto1X) Yogint 1 yogint yogin,yoginI,yoginy (Auto1X)

The method is simple. The results - promising. What's the wanted output format?

funderburkjim commented 6 years ago

LevAuto

An additional step of autosuggestion was carried out on the remaining 300+ items of pwis_notmw2.txt that have no suggestions by the previous steps.

An example will illustrate the conceptually simple process: One of these 300+ is *Maṅguśrī 1 maNguSrI. Now consider the unknown spellingmaNguSrI in light of all MW headwords, and find the headword or headwords which are closest in spelling to maNguSrI. Here, the closest headwords are those with minimal Levenshtein edit distance. Thus we must go through a process of examining the edit distance of each of the (approximately 200,000) MW headwords from the word maNguSrI, and choose those headwords with the smallest possible edit distance from maNguSrI. This list is used for the suggestion. In this case, the answer turns out to be the headwords aNgurI,maNgura,maDuSrI,maYjuSrI. In this case, the suggestion list contains what is almost surely the right spelling correction maYjuSrI.

The results are shown in pwis_notmw3.txt. The 300+ suggestions generated by this minimal edit distance technique are marked with (LevAuto).

While this technique is conceptually simple, it is computationally complex. In fact, the notmw3 LevAuto suggestions were generated by applying a Levenshtein Automaton built on top of the Pynini python library developed by Kyle Gorman. The details of my application are in this pynini-learn repository.

As mentioned there, the current implementation does not appear efficient enough to be very useful with such a large 'lexicon' as the 200,000 MW headword list. Gorman held out the possibility of a more efficient algorithm in this comment.

funderburkjim commented 6 years ago

How can one help here? The method is simple. The results - promising. What's the wanted output format?

There is now a file in the PWK repository where corrections can be entered: pwis_notmw3_correctionform.txt . Here is link to brief readme.

@drdhaval2785 If you already have corrections in some other format, I'll be glad to transfer them to the pwis_notmw3_correctionform.txt file.

@gasyoun Does this procedure satisfy your needs ?

drdhaval2785 commented 3 years ago

Corrections still are not prepared / installed. Saw Metron. Agastiya's. in webpage today.

funderburkjim commented 3 years ago

Example

The readme at https://github.com/sanskrit-lexicon/PWK/tree/master/pw_iast gives some background on anticipated usage. The objective is to change to modern IAST spellings various suspicious spellings in PW dictionary.

Here's how I would proceed to deal with the 'Agastiya' example.

open pwis_notmw3_correctionform.txt

link = https://github.com/sanskrit-lexicon/PWK/blob/master/pw_iast/pwis_notmw3_correctionform.txt Find 'Agastiya':

Case 0023: Agastiya 4 agastiya : Corrected_SLP1=
; Suggestion method: (Auto1X)   Corrected by: 
; Suggestions: agastIya,agastya

open pwis_notmw.html in browser.

Link is https://sanskrit-lexicon.github.io/PWK/pwis_notmw.html

and find 'Agastiya':

Agastiya | 4 | agastiya | OrvaSeya kalaSaBU kumBaBU kumBasaMBava

The 4 words 'OrvaSeya' are SLP1 spellings of headwords where the suspicious word Agastiya appears.

Examine instances

First, look up OrvaSeya in PW dictionary using one of the displays image

Examine scanned image to see what print actually is:

image

Decide modern IAST spelling:

I would decide 'Agastya' --- PW's IAST typically uses 'j' when modern IAST uses 'y'.

Examine other uses:

kalaSaBU  image

kumBaBU  image

kumBasaMBava
image

Choose Answer

All the cases are the same: print has 'Agastja', Modern form is 'Agastiya' 'Agastya'. Current pw.txt digitization has 'Agastiya'. Solution is to change to 'Agastya'

Fill in Correctionform for Case 23

Edit [pwis_notmw3_correctionform.txt]( https://github.com/sanskrit-lexicon/PWK/blob/master/pw_iast/pwis_notmw3_correctionform.txt

Case 0023: Agastiya 4 agastiya : Corrected_SLP1= Agastya
; Suggestion method: (Auto1X)   Corrected by: funderburkjim
; Suggestions: agastIya,agastya

Commit the change (commit message = 'Case 23').

funderburkjim commented 3 years ago

installing corrections

Filling in the correction form does not install the corrections to pw.txt. Installation would be a separate step done by either @drdhaval2785 or @funderburkjim .

This is a slow process, but looks reliable.

There are 1585 cases.

The end result would be improvement to modern IAST

gasyoun commented 3 years ago

I would decide 'Agastya' --- PW's IAST typically uses 'j' when modern IAST uses 'y'.

Exactly.

All the cases are the same: print has 'Agastja', Modern form is 'Agastiya'. Current pw.txt digitization has 'Agastiya'.

Modern form is 'Agastya', and not 'Agastiya' only.

funderburkjim commented 3 years ago

Wonder if @SergeA would have interest in working on this?

funderburkjim commented 3 years ago

Modern form is 'Agastya' 👍 Have corrected comment.

Andhrabharati commented 3 years ago

If you already have corrections in some other format, I'll be glad to transfer them to the pwis_notmw3_correctionform.txt file.

Wonder if @SergeA would have interest in working on this?

Can I poke-in my nose in this, if @funderburkjim is willing to work on it, if given in 'some other format'? [It's hardly ~2 days' work for me.]

Andhrabharati commented 1 year ago

This is one of the many cases that are "counter" to what was replied by @drdhaval2785 and @gasyoun against my posting somewhere [that I do not get @funderburkjim's response for months together, while others get almost 'immediately'], that my posts are "heavy-meals" and not easily chewable/digestible as are all others' postings.

Here, I just wrote a single sentence, and yet to get some/any response from Jim (for almost 2 years now)!

funderburkjim commented 1 year ago

@Andhrabharati Obviously, I lost track of your question here.

Please provide a couple of examples of what you mean by some other format.

When the current work with you on Grassman dictionary is complete, I will examine the feasibility of working with you on this.

Andhrabharati commented 1 year ago

Obviously, my resolutions would be non-slp1 but in plain iast.

I see that some of these are abbr.s with a dot after the ending is-tag. How to deal with them, something like <ab><is>Z.</is></ab>?

Andhrabharati commented 1 year ago

Now that I saw many later posts at this forum, I think the regular

old: yyy new: zzz

would be the way for me take it up, wrt the latest pw.txt lines (at csl-orig); probably with just the is-tagged word [yyy/zzz] (as at times the line could be quite longer).

Andhrabharati commented 1 year ago

Just had a "look" inside the pw.txt for the <is>-strings and noticed ~10k instances of <is>…</is> strings inside the italics {%…%}; whereas the print has vast majority of them (if not all) in normal-face (font) [& wide-spaced].

Also many more "bad"-tagging/marking of various types are seen.

This calls for a full overhaul of the data, and I get reminded of the earlier reaction of Thomas, if I say anything more!! [I had stopped working on pw after the <ls> marking those days, seeing Thomas's reaction on my post.]

I see not much worth taking up correction of just the <is>-tagged iast portion. But, is Jim ready/willing now to take up a collaborative work to "bring" a good-shape to pw.txt?

Andhrabharati commented 1 year ago

Did a quick checking for these "notmw" words in MW, and found quite many to be present in MW!

Just a few small changes/additions in the "search pattern" would eliminate all such ones from the list!!

Andhrabharati commented 1 year ago

When tried looking for some random words in the list, noticed that the CDSL pwk scan pages are not so clear, as compared to my copy. [Probably, it could be a reason for the typo errors.]

Probably, these scans could be replaced, for the benefit of any and everyone. [This point was discussed elsewhere earlier wherein I mentioned having good scans of the vol.s 2-7, and now I've all the 7 volumes in my possession.]

funderburkjim commented 1 year ago

@Andhrabharati Am ready to begin working with you in this issue related to <is> tag in PW.

  1. You identify a problem with markup re 'is' tag: <is>X</is> within italics {%Y%}.
  2. You suggest the need for some 'new' rules for auto-correction.

I think we should restrict this study to is-tag, if possible (so that this good issue can be solved).

What do you need from me?

Andhrabharati commented 1 year ago

I had progressed much ahead from this IAST part in pw in the past couple of days, @funderburkjim !!

I shall post my work in due course of time, for your perusal.

If you are willing, pl. give me the links to the conversion scripts to Devanagari and iast of the full file, to be run from a command line locally.

I am about to start proofing full headwords, as some errors were noted while working on this pw.txt [BTW, I had already marked grouped entries all throughout the file, just like in GRA and MW.]

Andhrabharati commented 1 year ago

My present work is covering all kinds of markups and listing the abbr. and ls entries.

funderburkjim commented 1 year ago

links to the conversion scripts to Devanagari and iast of the full file, to be run from a command line locally.

I could provide a python script that would run as follows

python pw_convert.py slp1,iast pw.txt pw_iast.txt
python pw_convert.py iast,slp1 pw_iast.txt pw_slp1.txt

This would convert the metaline  (k1 and k2) and all the {#X#} from slp1 to iast, and back.
And similarly for 'deva' instead of 'iast'

Is this what you request?

Andhrabharati commented 1 year ago

Yes, exactly.

Andhrabharati commented 1 year ago

Probably, you could leave the metalines as is, as I am not going to touch that portion.

My reading will be limited to the header and body portions alone.

The metalines would have to be generated from the header portion, as done in case of GRA.

funderburkjim commented 1 year ago

@Andhrabharati. Further discussions found in https://github.com/sanskrit-lexicon/PWK/issues/95.

We can leave this #419 issue open until the work in PWK repository completed.

Andhrabharati commented 1 year ago

Here are some small pieces from my work, wrt the <is> elements--

*` marked items (diff. in CSL & AB texts):`**

image

abbr. type items:

I see that some of these are abbr.s with a dot after the ending is-tag. How to deal with them, something like <ab><is>Z.</is></ab>?

<is n="Adhyātmarāmāyaṇa">Adhyātmar.</is> <is n="Adhyāya">Adhy.</is> <is n="Agni">A.</is> <is n="Āgnīdhra">Ā.</is> <is n="Agniṣṭoma">A.</is> <is n="Aṅga">A.</is> <is n="Apsaras">A.</is> <is n="Arka">A.</is> <is n="Āśvalāyana">A.</is> <is n="Atharvan">Ath.</is> <is n="Avanti">Av.</is> <is n="Ayodhyā">A.</is> <is n="Bālāhaka">B.</is> <is n="Bhaṇḍīratha">Bh.</is> <is n="Bhārgava">B.</is> <is n="Bhūliṅgā">Bh.</is> <is n="Brahman">B.</is> <is n="Brahman">Br.</is> <is n="Brahmaṇācchaṃsin">Br.</is> <is n="Cakora">C.</is> <is n="Camasa">C.</is> <is n="Dhanvantari">Dh.</is> <is n="Dūrvā">D.</is> <is n="Dvārakaukas">Dv.</is> <is n="Dvāravatī">Dv.</is> <is n="Gandharva">G.</is> <is n="Gaṇeśa">G.</is> <is n="Gārgya">G.</is> <is n="Himālaya">H.</is> <is n="Indra">I.</is> <is n="Jagatī">J.</is> <is n="Jamadagni">J.</is> <is n="Kālī">K.</is> <is n="Kānyakubja">Kānyak.</is> <is n="Kārikā">K.</is> <is n="Karṇāṭa">K.</is> <is n="Kāśi">K.</is> <is n="Kāśmīra">K.</is> <is n="Kosala">K.</is> <is n="Kuḍava">K.</is> <is n="Likhita">L.</is> <is n="Makara">M.</is> <is n="Manu">M.</is> <is n="Marut">M.</is> <is n="Mathurā">M.</is> <is n="Nairañjanā">N.</is> <is n="Nalikā">N.</is> <is n="Narmadā">N.</is> <is n="Nīlakaṇṭha">Nīlak.</is> <is n="Pañcālā">P.</is> <is n="Paphaka">P.</is> <is n="Pavamāna Stotra">P. St.</is> <is n="Puronuvākyā">P.</is> <is n="Pūru">P.</is> <is n="Pūṣan">P.</is> <is n="Rāma">R.</is> <is n="Rāmāyaṇa">R.</is> <is n="Revatī">R.</is> <is n="Śākaṭāyana">Śāk.</is> <is n="Sāman">S.</is> <is n="Śaṅkha">Ś.</is> <is n="Sarasvatī">S.</is> <is n="Savitar">S.</is> <is n="Sāyaṇa">Sāy.</is> <is n="Soma">S.</is> <is n="Śūdra">Ś.</is> <is n="Sumantra">S.</is> <is n="Tārkṣya">T.</is> <is n="Udgātar">U.</is> <is n="Udumbara">U.</is> <is n="Vaṅkara">V.</is> <is n="Vāsudeva">Vās.</is> <is n="Vāyu">V.</is> <is n="Veda">V.</is> <is n="Vidura">V.</is> <is n="Viśvarūpa">V.</is> <is n="Yajus">Y.</is> <is n="Yayāti">Y.</is> <is n="Yuvanāśva">Yuv.</is>

funderburkjim commented 1 year ago

@Andhrabharati I do not find any of your examples of .</is>
Can you provide the line-numbers of your examples?

funderburkjim commented 1 year ago

I should have made previous comment in https://github.com/sanskrit-lexicon/PWK/issues/95

Andhrabharati commented 1 year ago

Pl. see my initial post above --

it isn't .</is> but is </is>. in the CDSL file and I had brought the dot inside the is-tagging (and of course, there were some typos as well that I had corrected).

funderburkjim commented 1 year ago

How to deal with them, something like Z.?

@Andhrabharati In your pw versions discussed at https://github.com/sanskrit-lexicon/PWK/issues/95, you introduce markup such is <is n="TOOLTIP">Z.</is> (also <ab n="TOOLTIP">Z.</ab>, etc.), and the displays provide html so TOOLTIP is available to users.

This seems to answer the above question.