sanskrit-lexicon / PWK

Sanskrit-Wörterbuch in kürzerer Fassung, 7 Bände Petersburg 1879-1889
3 stars 1 forks source link

crefminusbib - fuzzymatch1 #53

Closed funderburkjim closed 8 years ago

funderburkjim commented 8 years ago

This applies the general levenshtein fuzzy-distance concept to the problem of matching abbreviations of literary sources from pw.txt to the abbreviations of reference works in the pw bibliography.

Specifically, it takes all the thus far (through #45) unmatched references appearing in crefminusbib.txt. There are 1621 of these .

And it uses the pwbib_abbrv_all.txt list of abbreviations from pwbib0, including those that are additional references (pwbib_new) and previously identified 'unused' abbreviations. There are 522 of these.

For each crefminusbib abbreviation X, an 'edit-distance' D is computed with each known abbreviation Y.
If D is <= 2 (a parameter that could be adjusted in later runs), then Y is added to the list S of suggestions for X.

Finally, for each X a result is posted to one of three output files:

I am tempted to generate corrections based on the 618 likely cases. My hunch is that these are almost surely pw.txt typos and that the likely suggestion is the correct correction for the typo.

But will wait a day or so until I and others have had a chance to look for 'false positives' (like MAT-MIT) in that list of 618.

When these likely cases (as possibly amended by our eyeball examination) are implemented as pw corrections, I'll take a look at further ways to program generation of highly likely corrections.

drdhaval2785 commented 8 years ago

It is a good resource.

Regarding likely case generation prepositions, the objectionable entries from likely.txt file (in old:sugegstednew:comment:line format) are

GARUD2A:GAUD2AP:not a reference. Proper name garuqa:<H1>100{gArutmata}1{gArutmata}¦ ²1) •Adj. {%vom Vogel%} ¯GARUD2A {%kommend%} ‹u.s.w.› ²2) (•*m. ¯GAL. ‹und› •n. ‹)› {%Smaragd%} ¯HEMA7DRI.1,305,17. ¯BHA7VAPR.1,97,266.268. ‹In Verbindung mit› #{aSman} ‹als Gegengift› ¯Spr.5910. PW35344
VAIDJAKA:VAIDJABH:A different commentator on kirAtArjunIyam:<H1>100{viSArada}1{viSArada}¦ ²1) •Adj. (•f. #{A}) ³a) {%erfahren , kundig , vertraut%} ‹(von Personen)› ; ‹das Worin oder Womit im› •Loc. ‹oder im Comp. vorangehend.› •Nom.abstr. #{°tva} •n. ¯PAN4K4AD. ³b) {%geschickt , gewandt , dem Zwecke entsprechend%} ‹(von Reden).› ³c) {%schön herbstlich%} ¯VA7SAV.203,1. ³d) {%klaren ~ , heiteren Sinnes%} ¯LALIT.458,18. ³e) {%der Redegabe ermangelnd%} ¯VA7SAV.203,1. ³f) {%dreist , frech%} ¯ebend. ³g) ‹*› = #{SrezWa}. ²2) •m. {%®Mimusops_Elengi%} ¯VAIDJAKA ‹im› •Comm. ‹zu› ¯KIR.5,11. ²3) •*f. #{A} {%eine Art Alhagi%} ¯RA7G4AN.4,57. PW104560

I checked till 120 and could find only 2 false positives. So I guess it is extremely safe to do generic changes based on likely.txt file @funderburkjim . Even if 5-10 errors creep in, they are better than spending too much time on verifying them. A risk worth taking.

funderburkjim commented 8 years ago

Added VAIDJAKA to pwbib_new, since the printed form (Large initial capital letter + small capital letters) is like that of references.

A small number (4) of non-matches due to a bug in pwbib_new. For instance, SIDDH.K.ed.TA7R. appeared in pwbib_new with ending period. This kept it from matching to the sortedcref abbreviations, where the period has been removed. Thus these four have been removed from corrections based on likely.txt:

    102:K4AKR.zu.SUC2R:K4AKR.zu.SUC2R.
    251:R.ed.Bomb:R.ed.Bomb.
    578:NR2S.TA7P.UP:NR2S.TA7P.UP.
    608:SIDDH.K.ed.TA7R:SIDDH.K.ed.TA7R.
gasyoun commented 8 years ago

Other than these few changes the list is ready to be installed.

GOLEBR:GOP.BR -> COLEBR.Alg.
<ls>GOLEBR.</ls> <noti>Alg. 228.</noti> -> COLEBR.Alg.

KERN:KIR -> leave as is
KRRN:KIR -> leave as is

SIDDH.K:SADDH.P -> SIDDH.K.
TANTRAS:TATTVAS -> leave as is
MANU:KAN2 -> leave as is

VIKRAM:VIRAM -> leave as is
EBEN:WEBER -> ordinary German word, no source here quoted
HALL.N:MALLIN -> HALL in (in = inside, ordinary German word)

Dhaval - KRRN is to be changed to KERN. See capture

funderburkjim commented 8 years ago

Some more false positives:

¯SIDDH.K.@¯SADDH.P.   add SIDDH.K to pwbib_new
¯SATJA.@¯SA7J.   add SATJA to pwbib_new
¯SV.FUR.@¯SV.A7R.  -> ¯SV. ‹für›
¯TRIPR.@¯TRIK. 3 instances.  Related to .GAN2IT.TRIPR.
¯VAIC2JA.@¯VIC2VA  not a reference.
¯VIKRAM.@¯VIRAM  once only, print change to the common ref. VIKRAMA7N5KAK4
¯VISHN2U.@¯VISHN2US.  Not sure. Not changing for now. See image below

image

¯VJA7SA.@¯VA7SAV.  (3 times)  Always as ¯VJA7SA ‹zu› ¯JOGAS..  Adding as new source for now.
¯KALP.@¯KAP.  should be Devanagari #[kalp}
¯K4LAK.@¯K4AKR.  should be KA7LAK4
¯EBEN.@¯WEBER.  eben is German word
¯BHAR.@¯BHAG.  No change. Add to pwbib_new. Maybe same as BHARATA. hw = atyantara
funderburkjim commented 8 years ago

From @gasyoun 's list:

KERN:KIR -> leave as is   Add KERN to pwbib_new
TANTRAS:TATTVAS -> leave as is   Add TANTRAS to pwbib_new
MANU:KAN2 -> leave as is   add MANU to pwbib_new

VIKRAM:VIRAM -> leave as is  see my note above

No additional comment on others, agree.

funderburkjim commented 8 years ago

Our three reviews now complete. Thanks!

Will begin installation of changes.

funderburkjim commented 8 years ago

A few more false positives and other things noticed during installation:

¯AGHI@¯RAGH@t@   Part of a corruption of  AGNI-P (1 time)
¯BR2H.@¯R2V.  Clearly wrong. Solution will have to be done later.
¯C2AT. => ¯C2ATR.  Can't do this specific change, due to presence of C2AT.BR. Refinement needed.. The change needed is to ¯C2AT. ‹Br› ¯3,8,227.3,29. (hw = prasuta).
¯JOGAC2. => ¯JOGAS. is not right.
¯JOGAJ. => ¯JOGAS.   
   1. JOGAJ added to pwbib_new. Possibly yogayAjYavalkya
     {ditisutaguru}1{ditisutaguru}¦ •m. {%der Planet Venus%} ¯VARA7H.BR2H.23,6¯JOGAJ.6,7.
    2. Under hw sArvaBOma,
        ¯VJA7SA ‹zu› ¯JOGAJ.1,1.   This one Should be JOGAS
¯K4HA7ND. => ¯K4A7N2.  Can't do this everywhere, due to K4HA7ND.UP abbreviation.  There 
   are a few other cases, such as ¯K4HA7ND. ¯UP.
KA7VJA7D.@¯KA7VJA7L  add KA7VJA7D to pwbib_new, possibly kAvyAdarSa
KAUTUKAS.@¯KAUTUKAR  add KAUTUKAS to pwbib_new, possibly kOtukasarvasva
¯MAHA7BH. => ¯MAHA7B. (202 cases).  This is error in pwbib0, MAHA7BH is correct. No change to pw.
¯NAISH. => ¯NIGH.  (533 case). This also error in pwbib0.  NAISH is correct
¯NAIGH.@¯NIGH  (21 cases).  Added NAIGH to pwbib_new.  Possibly nEGaRwuka
¯VP^2.@¯VP.^2.   Odd, could not find this in pw.txt.  Maybe an artifact of abbrv.py (clean) ?

Will have to continue installation tomorrow.

drdhaval2785 commented 8 years ago

@funderburkjim A few more false positives. Screening 4 letter ones in notepad++. They are the most notorious to have false positives because of edit distance method shortcomings in short strings.

C2AR:C2ATR:Correction in vEzRava is not proper. It is C2AT.BR. (SatapaTa brAhmaRa). Correction in PUwkAra headword is proper. It is C2ATR.
PASS:VA7S:This is not a resource. It is Pass. in dictionary. change from ¯PASS.->Pass. in pw.txt:<H1>500{kIrtay}1{kIrtay}¦ , #{kIrta/yati} ‹(episch auch •Med.)› ²1) {%Commemorare , gedenken , Erwähnung thun , nennen , aufführen , hersagen , mittheilen , verkünden , erzählen , rühmend erwähnen%} ; ‹mit› •Gen. ‹oder› •Acc.(nur ‹dieser später).› ²2) {%Etwas als Etwas erwähnen , erkläre für , nennen , heissen%} ; ‹mit zwei› •Acc. ¯PASS. {%heissen , gelten für.%} ‹--› •Desid. {%Erwähnung thun wollen%} , ‹mit› •Gen. ¯AIT.A7R.469,19 ‹(› #{na cikIrtayizet} ‹zu lesen).›
PA7T:KA7T:Print has PA7T. I am not sure whether the word refers to pataYjali or kAtyAyana.:<H1>100{SOlkika}1{SOlkika}¦ ²1) •*Adj. ‹von› #{Sulka} ¯PA7T. ‹zu.› ¯P.4,1.104. ¯Va7rtt.13. ²2) •m. {%Zollaufseher , Steuereinnehmer%} ¯Ind.Antiq.7,72.8,302. PW114595
KUHN:KULL:Kuhn is an author. Add in pwbib_new.:<H1>100{lAkza}1{lAkza}¦ •Adj. ¯Ind.St.1,110,7 ‹nach› ¯KUHN. ‹Fehlerhaft für› #{lAkzma} {%an die †Lakshmi gerichtet.%} PW95982
BR2H:R2V:It is always used like ¯BR2H.A7R.UP. There is only one entry where it is an error. `{%Führer%} , ¯BR2H.`->`¯FÜHRER, BR2H`. It is already an entry in pwbib0.txt
GNIP:NIR:It is AGNI-P:<H1>100{kAmadevamaya}1{kAmadevamaya}¦ •Adj. {%den Liebesgott darstellend%} ¯GNIP.37,11. PW26546
BHAR:BHAG:It is a separate commentator ¯BHAR. zu AK.: <H1>100{atyantara}1{*atyantara}¦ •Adj. {%sehr befreundet%} ¯BHAR. zu AK. PW2254

This ends eyeball examination of entries having 4 letters.

drdhaval2785 commented 8 years ago

Eyeball examination of 5 letter entries

KRAKA:K4AKR:It is for K4ARAKA:<H1>500{kzar}1{kzar}¦ , #{kza/rati} ‹(metrisch auch› #{kzarate}) ‹und› #{*kzariti} ²1) {%fliessen , strömen ; von Wassern%} ‹u.s.w.› ²2) {%gleiten.%} ²3) {%zerfliessen , zerrinnen , schwinden , vergehen , zu Nichte werden.%} ²4) {%einer Sache%} (•Abl.) {%verlustig gehen.%} ²5) {%Etwas (•Acc.) strömen , ausströmen , giessen.%} #{mUtram} {%Urin entlassen%} ¯KRAKA.2,4. ²6) ‹ohne Object› {%einen Strom entlassen.%}
NAIGH:NIGH:A new resource. No change required. Total 21 occurrences in pw.txt
K4RKA:K4AKR:It is to be changed to K4ARAKA:<H1>100{cukraka}1{cukraka}¦ ²1) •*n. {%®Rumex_vesicarius.%} ²2) •f. #{cukrikA} ³a) {%®Oxalis_corniculata%} ¯BHA7VAPR.1,283.¯K4ARKA.6,9. ³b) {%*ein best. präparirter saurer Reisschleim%} ¯RA7G4AN.15,89. PW40405
C2IVA:C2IC2:It is to be converted to c2iva. Not a reference:<H1>100{kedAranATa}1{kedAranATa}¦ •m. ‹Bez.› {%des in †Keda7ra verehrten%} ¯C2IVA. PW30636
ME.SH:MEGH:to be changed to MED.SH:<H1>100{sakawAkza}1{sakawAkza}¦ ²1) •*Adj. {%Seitenblicke werfend%} ¯ME.SH.57.¯H.an.4,323 ‹(vgl.› ¯ZACH.Beitr.92). #{°m} •Adv. {%mit einem Seitenblick%} ¯MBH.8,60,42. ²2) •*m. {%®Anogeissus_latifolia%} ¯H.an. PW117171
K4LAK:K4AKR:to be changed to K4A7LAK4:<H1>100{ganDavajrA}1{ganDavajrA}¦ ‹und› #{°vajrI} •f. ‹N.pr. einer Gottheit› ¯K4LAK.3,130.145,4,77,5,16. PW34597
K4ARK:K4AKR: To be changed to K4ARAKA:<H1>100{BOrja}1{BOrja}¦ •Adj. {%von der Birke kommend%} ¯K4ARK.1,3. PW81147
K4ARA:K4AKR: to be changed to K4ARAKA:<H1>100{Bizagvid}1{Bizagvid}¦ •m. {%Arzenei%} ¯K4ARA.5,12,6,12. PW79884
NILAR:NI7LAK:It is part of NI7LAR.UP. but UP has been separated. 
TRIPR:TRIK:This is subset of ¯GAN2IT.TRIPR. wongly written as ¯GAN2IT.¯TRIPR.:<H1>100{digjyA}1{digjyA}¦ •f. {%der Azimuth cosinus eines Ortes%} ¯GAN2IT.¯TRIPR.45.fgg. ¯GOLA7DHJ.13,26. PW50084
PAT.P:PR.P:To be changed to PAT.zu.P:<H1>100{SoBanika}1{SoBanika}¦ •m. ‹Bez.› {%einer Art Schauspieler%} ¯PAT.P.3,1,26.¯Va7rtt.15, ‹v.l.› #{SOBika}. PW114450
JALLT:LALIT:To be changed to JOLLY:<H1>100{yAvaddeya}1{yAvaddeya°}¦ •Adv. {%bis zur Abtragung einer Schuld%} ¯JALLT., ‹Schuld. 300.› PW90934
HALL,:HA7LA:No change. HALL is a new resource:<H1>000{raRastamBa}1{raRastamBa}¦ ‹desgl. Nach› ¯HALL,¯VP.22,158. ‹fehlerhaft.› PW92223

This ends eyeball corrections for 5 letter words.

gasyoun commented 8 years ago

¯VISHN2U.@¯VISHN2US. Not sure. Not changing for now. See image below seems to be a print error to me, @drdhaval2785 ?

funderburkjim commented 8 years ago

From @drdhaval2785 's 4-letter list.

Agree with all. Here's how I'm handling the PAT situtation.

Re 'PA7T:KA7T:' under SOkika. PWBIB has PAT. ZU P. == PATAN4G4ALI ZU PA7N2INI. and the example is ¯PA7T. ‹zu.› ¯P.4,1.104.. So this must be print error for ¯PAT.zuP.4,1.104. The form ¯PAT. ‹zu› ¯P. is quite common (approx. 136 instances). in pw.txt, and should be changed to ¯PAT.zuP to conform to pwbib0.

Many of these are of the further form like ¯PAT. ‹zu› ¯P.1,1.4, ¯Va7rtt.1.6.

Va7rtt. == Va7rttika. in pwbib.

A small number (4) are like ¯PAT. ‹zu› ¯Va7rtt.1 ‹zu› ¯P.1,1,29. which I am changing (as print change) to ¯PAT.zuP.1,1,29. ‹zu› ¯Va7rtt.1 and similarly for the others.

Finally, there are 3 'naked' ¯PAT. references. I've added a PAT = PataYjali item to pwbib_new for these.

funderburkjim commented 8 years ago

From @drdhaval2785 's 5-letter list.

 ME.SH   
   ¯ME.SH.57.¯H.an.4,323   -> ¯MED.sh.57.H.an.4,323  
    add MED.sh to pwbib_new.  (no idea what it might be)
    Made 'H' as part of this reference.  It could be a separate reference (not in pwbib). Not sure.

K4LAK:K4AKR:to be changed to K4A7LAK4  . Minor difference, changed to KA7LAK4  (in pwbib0)

No additional comments. Agree with rest.

funderburkjim commented 8 years ago

Re ¯VISHN2U.@¯VISHN2US.

For now, I'll put VISHN2U into pwbib_new, though it may be a print error and should be changed to a non-reference 'Vishn2u'.

funderburkjim commented 8 years ago

Re ¯VP We have some variants. At some earlier time, we discussed the use of 'VP.^2' in pwbib and in pw.txt, and agreed that we should make use of the magic of unicode to represent both pwbib and pw.txt more accurately as ¯VP.² . I've changed this in pwbib, and did an analysis of the cases in pw.txt, as follows:

1316 ¯VP    Total number of matches with this 3-character string. It partitions as follows:
 913 ¯VP.#  (# some digit)
   7 ¯VP.<space>
 268 ¯VP.^2   CHANGING to ¯VP.²
  93 ¯VP.²  Already the standard form, no change needed
  33 ¯VP².  CHANGING to ¯VP.²
   2  misc. errors  

There are also several (50+) `¯Vp2` which, by looking at a small sample, I think it is safe to change to `¯VP².`.
funderburkjim commented 8 years ago

Corrections now installed, and matching code rerun.

Here's the updated scorecard.

bibminuscref.txt still has 0 items. There are 995 cases in crefminusbib.txt. (previously there were 1621).