Open funderburkjim opened 6 years ago
Some of these have spelling errors
This is a result of manual checking, right?
<is>
tag.
These cases are identified in the current digitization by the <is>
tag. The reason Thomas originally coded these words is that, as the print example shows, they appear with wide letter spacing. Thomas
original coding was converted to the current <is>
xml-type tag: <is>Agastiya</is>
.
<is>
tagThere are 4858 distinct text instances of the <is>
tag.
We want to find spelling errors.
It is expected that many of these 4858 instances are spelled correctly. One way to make a separation
into cases which are probably correctly spelled and cases which possibly are incorrectly spelled is to
make use of a list of known correctly spelled words. For this purpose, we are using the headwords of MW (193,000 distinct such headwords).
After converting the IAST words to lower case, and then transcoding from IAST to slp1, we can compare to the list of MW headwords. The result is that 3273 of the words are recognized as MW headwords (therefore probably correctly spelled) 1585 of the words are not so recognized, and therefore need further examination.
These two lists are in this gist
Each line shows
There is also an html file for the nonmw list. This contains a link to PW basic display for each PW headword where the questionable IAST spelling occurs.
Make a local copy of the pwis_notmw.txt file, and also of the pwis_notmw.html file.
Indicate corrections in the pwis_notmw.txt file by adding a 4th field with the correct spelling in SLP1 form.
Post processing program can convert the SLP1 correction back to IAST. It is probably easier (for @drdhaval2785 , at least) to enter the correction in SLP1 rather than the diacritics required in many of the IAST spellings).
Then submit back to me the corrected file. I'll convert these to standard 'updateByLine' old/new corrections for PW, and install the corrections.
Don't worry about whether the correction is a typo or print error. Probably almost all are typos.
One more possibility to reduce the list.
E.g. pracetas - praketas paYcagavya - paNkagavya etc.
If we make replacement from k to c and find the word in MW headword list, it can be listed as auto corrected.
More observations to reduce list will be enumerated as and when I encounter such tendencies which are manageable programmatically
pwis_notmw1.txt has been added to the gist.
This contains the same list of 1585 words as in pwis_notmw.txt , but with 179 autocorrections. The autocorrections are generated by the rules:
k'
(k-acute) is used for 'c'; In the original AS coding, this k' would have been
written as 'k4'; if the typist missed the accent, it would be just k
.g'
was PW's IAST for 'j'.These rules were applied to slp1 spelling of each of the 1585; if one of the rules resulted in a new spelling which matched an MW headword, then this was indicated in the output (pwis_notmw1.txt) by
(Auto)
as a fifth field, to distinguish it as an autocorrection.@drdhaval2785 This should help a bit, by autocorrecting 11% of the cases. You could download the pwis_notmw1.txt and work from it.
'k4'; if the typist missed the accent, it would be just k
That explains a lot.
11% of the cases
Well done, well done.
Dhaval, thanks again for being back. This one still remains the major dictionary. Not widely used in India, because people tend to forget German, but the most academic one up to now.
This is based on an idea in article How to Write a Spelling Corrector by Peter Norvig.
Consider example of yAjNavalkya, in slp1 spelling.
The idea is
See pwis_notmw2.txt
N 1 n X,O,x,o,F,f,nI,nO,nA,nE,an,in,nU,ni,na,nf,nu,A,E,I,U,a,e,nF,i,no,u (Auto1X)
How can one help here?
Visṇu 1 visRu visru,vizRu (Auto1X) Vrṣṇi 1 vrzRi vArzRi,vfzRi (Auto1X) Yogint 1 yogint yogin,yoginI,yoginy (Auto1X)
The method is simple. The results - promising. What's the wanted output format?
An additional step of autosuggestion was carried out on the remaining 300+ items of pwis_notmw2.txt that have no suggestions by the previous steps.
An example will illustrate the conceptually simple process:
One of these 300+ is *Maṅguśrī 1 maNguSrI
. Now consider the unknown spellingmaNguSrI
in light of all MW headwords, and find the headword or headwords which are closest in spelling to maNguSrI
. Here, the closest headwords are those with minimal Levenshtein edit distance. Thus
we must go through a process of examining the edit distance of each of the (approximately 200,000) MW headwords from the word maNguSrI
, and choose those headwords with the smallest possible
edit distance from maNguSrI
. This list is used for the suggestion. In this case, the answer
turns out to be the headwords aNgurI,maNgura,maDuSrI,maYjuSrI
. In this case, the suggestion list
contains what is almost surely the right spelling correction maYjuSrI
.
The results are shown in pwis_notmw3.txt.
The 300+ suggestions generated by this minimal edit distance technique are marked with (LevAuto)
.
While this technique is conceptually simple, it is computationally complex. In fact, the notmw3 LevAuto suggestions were generated by applying a Levenshtein Automaton built on top of the Pynini python library developed by Kyle Gorman. The details of my application are in this pynini-learn repository.
As mentioned there, the current implementation does not appear efficient enough to be very useful with such a large 'lexicon' as the 200,000 MW headword list. Gorman held out the possibility of a more efficient algorithm in this comment.
How can one help here? The method is simple. The results - promising. What's the wanted output format?
There is now a file in the PWK repository where corrections can be entered: pwis_notmw3_correctionform.txt . Here is link to brief readme.
@drdhaval2785 If you already have corrections in some other format, I'll be glad to transfer them to the pwis_notmw3_correctionform.txt file.
@gasyoun Does this procedure satisfy your needs ?
Corrections still are not prepared / installed.
Saw Metron. Agastiya's.
in webpage today.
The readme at https://github.com/sanskrit-lexicon/PWK/tree/master/pw_iast gives some background on anticipated usage. The objective is to change to modern IAST spellings various suspicious spellings in PW dictionary.
Here's how I would proceed to deal with the 'Agastiya' example.
link = https://github.com/sanskrit-lexicon/PWK/blob/master/pw_iast/pwis_notmw3_correctionform.txt Find 'Agastiya':
Case 0023: Agastiya 4 agastiya : Corrected_SLP1=
; Suggestion method: (Auto1X) Corrected by:
; Suggestions: agastIya,agastya
Link is https://sanskrit-lexicon.github.io/PWK/pwis_notmw.html
and find 'Agastiya':
Agastiya | 4 | agastiya | OrvaSeya kalaSaBU kumBaBU kumBasaMBava
The 4 words 'OrvaSeya' are SLP1 spellings of headwords where the suspicious word Agastiya
appears.
First, look up OrvaSeya in PW dictionary using one of the displays
Examine scanned image to see what print actually is:
I would decide 'Agastya' --- PW's IAST typically uses 'j' when modern IAST uses 'y'.
kalaSaBU
kumBaBU
kumBasaMBava
All the cases are the same: print has 'Agastja', Modern form is 'Agastiya' 'Agastya'. Current pw.txt digitization has
'Agastiya'.
Solution is to change to 'Agastya'
Edit [pwis_notmw3_correctionform.txt]( https://github.com/sanskrit-lexicon/PWK/blob/master/pw_iast/pwis_notmw3_correctionform.txt
Case 0023: Agastiya 4 agastiya : Corrected_SLP1= Agastya
; Suggestion method: (Auto1X) Corrected by: funderburkjim
; Suggestions: agastIya,agastya
Commit the change (commit message = 'Case 23').
Filling in the correction form does not install the corrections to pw.txt. Installation would be a separate step done by either @drdhaval2785 or @funderburkjim .
This is a slow process, but looks reliable.
There are 1585 cases.
The end result would be improvement to modern IAST
I would decide 'Agastya' --- PW's IAST typically uses 'j' when modern IAST uses 'y'.
Exactly.
All the cases are the same: print has 'Agastja', Modern form is 'Agastiya'. Current pw.txt digitization has 'Agastiya'.
Modern form is 'Agastya', and not 'Agastiya' only.
Wonder if @SergeA would have interest in working on this?
Modern form is 'Agastya' 👍 Have corrected comment.
If you already have corrections in some other format, I'll be glad to transfer them to the pwis_notmw3_correctionform.txt file.
Wonder if @SergeA would have interest in working on this?
Can I poke-in my nose in this, if @funderburkjim is willing to work on it, if given in 'some other format'? [It's hardly ~2 days' work for me.]
This is one of the many cases that are "counter" to what was replied by @drdhaval2785 and @gasyoun against my posting somewhere [that I do not get @funderburkjim's response for months together, while others get almost 'immediately'], that my posts are "heavy-meals" and not easily chewable/digestible as are all others' postings.
Here, I just wrote a single sentence, and yet to get some/any response from Jim (for almost 2 years now)!
@Andhrabharati Obviously, I lost track of your question here.
Please provide a couple of examples of what you mean by some other format
.
When the current work with you on Grassman dictionary is complete, I will examine the feasibility of working with you on this.
Obviously, my resolutions would be non-slp1 but in plain iast.
I see that some of these are abbr.s with a dot after the ending is-tag. How to deal with them, something like <ab><is>Z.</is></ab>
?
Now that I saw many later posts at this forum, I think the regular
old: yyy new: zzz
would be the way for me take it up, wrt the latest pw.txt lines (at csl-orig); probably with just the is-tagged word [yyy/zzz] (as at times the line could be quite longer).
Just had a "look" inside the pw.txt for the <is>
-strings and noticed ~10k instances of <is>…</is>
strings inside the italics {%…%}
; whereas the print has vast majority of them (if not all) in normal-face (font) [& wide-spaced].
Also many more "bad"-tagging/marking of various types are seen.
This calls for a full overhaul of the data, and I get reminded of the earlier reaction of Thomas, if I say anything more!!
[I had stopped working on pw after the <ls>
marking those days, seeing Thomas's reaction on my post.]
I see not much worth taking up correction of just the <is>-tagged
iast portion.
But, is Jim ready/willing now to take up a collaborative work to "bring" a good-shape to pw.txt?
Did a quick checking for these "notmw" words in MW, and found quite many to be present in MW!
Just a few small changes/additions in the "search pattern" would eliminate all such ones from the list!!
When tried looking for some random words in the list, noticed that the CDSL pwk scan pages are not so clear, as compared to my copy. [Probably, it could be a reason for the typo errors.]
Probably, these scans could be replaced, for the benefit of any and everyone. [This point was discussed elsewhere earlier wherein I mentioned having good scans of the vol.s 2-7, and now I've all the 7 volumes in my possession.]
@Andhrabharati Am ready to begin working with you in this issue related to <is>
tag in PW.
<is>X</is>
within italics {%Y%}
.I think we should restrict this study to is-tag, if possible (so that this good issue can be solved).
What do you need from me?
I had progressed much ahead from this IAST part in pw in the past couple of days, @funderburkjim !!
I shall post my work in due course of time, for your perusal.
If you are willing, pl. give me the links to the conversion scripts to Devanagari and iast of the full file, to be run from a command line locally.
I am about to start proofing full headwords, as some errors were noted while working on this pw.txt [BTW, I had already marked grouped entries all throughout the file, just like in GRA and MW.]
My present work is covering all kinds of markups and listing the abbr. and ls entries.
links to the conversion scripts to Devanagari and iast of the full file, to be run from a command line locally.
I could provide a python script that would run as follows
python pw_convert.py slp1,iast pw.txt pw_iast.txt
python pw_convert.py iast,slp1 pw_iast.txt pw_slp1.txt
This would convert the metaline (k1 and k2) and all the {#X#} from slp1 to iast, and back.
And similarly for 'deva' instead of 'iast'
Is this what you request?
Yes, exactly.
Probably, you could leave the metalines as is, as I am not going to touch that portion.
My reading will be limited to the header and body portions alone.
The metalines would have to be generated from the header portion, as done in case of GRA.
@Andhrabharati. Further discussions found in https://github.com/sanskrit-lexicon/PWK/issues/95.
We can leave this #419 issue open until the work in PWK repository completed.
Here are some small pieces from my work, wrt the <is> elements
--
*` marked items (diff. in CSL & AB texts):`**
abbr. type items:
I see that some of these are abbr.s with a dot after the ending is-tag. How to deal with them, something like
<ab><is>Z.</is></ab>
?
<is n="Adhyātmarāmāyaṇa">Adhyātmar.</is>
<is n="Adhyāya">Adhy.</is>
<is n="Agni">A.</is>
<is n="Āgnīdhra">Ā.</is>
<is n="Agniṣṭoma">A.</is>
<is n="Aṅga">A.</is>
<is n="Apsaras">A.</is>
<is n="Arka">A.</is>
<is n="Āśvalāyana">A.</is>
<is n="Atharvan">Ath.</is>
<is n="Avanti">Av.</is>
<is n="Ayodhyā">A.</is>
<is n="Bālāhaka">B.</is>
<is n="Bhaṇḍīratha">Bh.</is>
<is n="Bhārgava">B.</is>
<is n="Bhūliṅgā">Bh.</is>
<is n="Brahman">B.</is>
<is n="Brahman">Br.</is>
<is n="Brahmaṇācchaṃsin">Br.</is>
<is n="Cakora">C.</is>
<is n="Camasa">C.</is>
<is n="Dhanvantari">Dh.</is>
<is n="Dūrvā">D.</is>
<is n="Dvārakaukas">Dv.</is>
<is n="Dvāravatī">Dv.</is>
<is n="Gandharva">G.</is>
<is n="Gaṇeśa">G.</is>
<is n="Gārgya">G.</is>
<is n="Himālaya">H.</is>
<is n="Indra">I.</is>
<is n="Jagatī">J.</is>
<is n="Jamadagni">J.</is>
<is n="Kālī">K.</is>
<is n="Kānyakubja">Kānyak.</is>
<is n="Kārikā">K.</is>
<is n="Karṇāṭa">K.</is>
<is n="Kāśi">K.</is>
<is n="Kāśmīra">K.</is>
<is n="Kosala">K.</is>
<is n="Kuḍava">K.</is>
<is n="Likhita">L.</is>
<is n="Makara">M.</is>
<is n="Manu">M.</is>
<is n="Marut">M.</is>
<is n="Mathurā">M.</is>
<is n="Nairañjanā">N.</is>
<is n="Nalikā">N.</is>
<is n="Narmadā">N.</is>
<is n="Nīlakaṇṭha">Nīlak.</is>
<is n="Pañcālā">P.</is>
<is n="Paphaka">P.</is>
<is n="Pavamāna Stotra">P. St.</is>
<is n="Puronuvākyā">P.</is>
<is n="Pūru">P.</is>
<is n="Pūṣan">P.</is>
<is n="Rāma">R.</is>
<is n="Rāmāyaṇa">R.</is>
<is n="Revatī">R.</is>
<is n="Śākaṭāyana">Śāk.</is>
<is n="Sāman">S.</is>
<is n="Śaṅkha">Ś.</is>
<is n="Sarasvatī">S.</is>
<is n="Savitar">S.</is>
<is n="Sāyaṇa">Sāy.</is>
<is n="Soma">S.</is>
<is n="Śūdra">Ś.</is>
<is n="Sumantra">S.</is>
<is n="Tārkṣya">T.</is>
<is n="Udgātar">U.</is>
<is n="Udumbara">U.</is>
<is n="Vaṅkara">V.</is>
<is n="Vāsudeva">Vās.</is>
<is n="Vāyu">V.</is>
<is n="Veda">V.</is>
<is n="Vidura">V.</is>
<is n="Viśvarūpa">V.</is>
<is n="Yajus">Y.</is>
<is n="Yayāti">Y.</is>
<is n="Yuvanāśva">Yuv.</is>
@Andhrabharati I do not find any of your examples of .</is>
Can you provide the line-numbers of your examples?
I should have made previous comment in https://github.com/sanskrit-lexicon/PWK/issues/95
Pl. see my initial post above --
it isn't .</is>
but is </is>.
in the CDSL file and I had brought the dot inside the is-tagging (and of course, there were some typos as well that I had corrected).
How to deal with them, something like
? Z.
@Andhrabharati
In your pw versions discussed at https://github.com/sanskrit-lexicon/PWK/issues/95,
you introduce markup such is <is n="TOOLTIP">Z.</is>
(also <ab n="TOOLTIP">Z.</ab>
, etc.),
and the displays provide html so TOOLTIP is available to users.
This seems to answer the above question.
In the PW dictionary, a relatively small number of words appear in IAST spellings; for examole
Some of these have spelling errors in the Cologne digitization:
This issue is devoted to correcting such spelling errors.