sanskrit-lexicon / PWK

Sanskrit-Wörterbuch in kürzerer Fassung, 7 Bände Petersburg 1879-1889
3 stars 1 forks source link

bib minus cref, part 1 #18

Closed drdhaval2785 closed 8 years ago

drdhaval2785 commented 8 years ago

https://github.com/sanskrit-lexicon/PWK/blob/master/pw_ls/pwbib/diffstudy/bibminuscref.xml

This is the file which is being studied here.

These are the entries which were there in pwbib1.txt but not found in cref.

drdhaval2785 commented 8 years ago

1 ANUKRAM.zuR2V - Not found in pw.xml, neither are ANUKRAM and zuR2V separately. 2 KA7TJ(A7JANA) - This is a full form for KA7TJ used at some places like KA7TJ.C2R. 3 KA7TJ.SANA7NAS - Not found in pw.xml 4 SAM5NJ.Up -> SAM5NJ.UP, total 8 occurrences in pw.xml

drdhaval2785 commented 8 years ago

5 VET.(U.) - Correct one. Our pw.xml had mismatched brackets earlier. It should have been fixed now.

6 WEBER,GJOT -> WEBER,G4JOT

7 Spr - Not found. It seems to be an extention for Beitr.

8 DEC2IN->DEC2I7N, 63 matches

drdhaval2785 commented 8 years ago

9 Mat.med There are 7 entries with Mat.Med (capital) and 400 entries with Mat.med (Small). So I guess we should convert all Mat.Med to Mat.med.

EJF: Agree. from scan check of keruka and taruRa, the 'Mat.Med.' is an OCR error.

drdhaval2785 commented 8 years ago

10 PISCHEL,deGr.pr Usually there are no small letters in 'ls'. Therefore our regex in crefs missed this entry. But there actually is a work referred to by this entry, rightly found by bib1.txt. Let's keep it as it is.

drdhaval2785 commented 8 years ago

11 DIVJA7V<AD>->DIVJA7VAD Not a tag. DIVJA7VAD has 112 matches.

drdhaval2785 commented 8 years ago

12 HAEB.Anth->HAEB.ANTH Currently there is only one entry with capitals.

drdhaval2785 commented 8 years ago

13 H.an->H.AN

14 LEUMANNA,Aup.Gl - Not found in pw.xml, nor are its components. @funderburkjim may reverify

15 KA7VJA7->KA7VJA7D; All entries are having D at the end. pwbib1.txt wrongly identified the work. It seems to be the kAvyAdarSa of daRqin.

drdhaval2785 commented 8 years ago

16 HILLEBR. It missed regex for cref, because this work has always an 'N' suffixed to it like HILLEBR.N.

drdhaval2785 commented 8 years ago

17 SADDH.P.4->SADDH. P.4 is page reference. There are other page references too.

drdhaval2785 commented 8 years ago

18 K4ANDRA7LOKA -> Wrong identification of work by pwbib1.txt There are only three occurrences if K4ANDRA in the whole of pw.xml.

    Line 29549: <H1><h><key1>kusuta</key1><key2>kusuta</key2></h><body><gram n="m">m.</gram> <i>der Planet Mars</i> <ls>VISHNUK4ANDRA.</ls> <noti>im</noti> <gram n="Comm">Comm.</gram> <noti>zu</noti> <ls>VARA7H.BR2H.2,20.</ls> PW29547</body><tail><L>29545</L><pc>2086-1</pc></tail></H1>
    Line 84050: <H1><h><key1>mah</key1><key2>ma/h</key2><hom>2</hom></h><body><divm type="e" n="1">1)</divm> <gram n="Adj">Adj.</gram> (<gram n="f">f.</gram> <noti>ebenso und</noti> <s>mahI/</s>) <divm type="n" n="a">a)</divm> <i>gross , gewaltig , mächtig , reichlich.</i> <divm type="n" n="b">b)</divm> <i>alt , bejahrt.</i> <divm type="e" n="2">2)</divm> <gram n="f">f.</gram> <s>mahI/</s> <divm type="n" n="a">a)</divm> <i>die Erde.</i> <noti>Als Bez.</noti> <i>der Zahl Eins</i> <ls>G4AN2ITA.K4ANDRAGR.3.</ls> <divm type="n" n="b">b)</divm> <i>Erdboden.</i> <gram n="Pl">Pl.</gram> <ls>SPR.1509.</ls> <divm type="n" n="c">c)</divm> <i>Boden , Grund , Land.</i> <divm type="n" n="d">d)</divm> <i>Reich.</i> <divm type="n" n="e">e)</divm> <i>Erde</i> <noti>als</noti> <i>Stoff.</i> <divm type="n" n="f">f)</divm> <i>Basis eines Dreiecks <noti>oder</noti> einer anderen Figur.</i> <divm type="n" n="g">g)</divm> <gram n="Du">Du.</gram> <i>Himmel und Erde.</i> <divm type="n" n="h">h)</divm> <i>Raum.</i> <divm type="n" ...
    Line 125909: <H1><h><key1>suKArTa</key1><key2>suKArTa</key2></h><body><gram n="m">m.</gram> <i>eine Sache des Wohlbehagens , ~ der Lust.</i> <gram n="Acc">Acc.</gram> (<ls>GAN2IT.26,7.</ls><ls>K4ANDRAGRAH.24,35</ls>) <noti>und</noti> <gram n="Dat">Dat.</gram> <i>der Annehmlichkeit ~ , der Bequemlichkeit wegen , zur Erleichterung.</i> PW125903</body><tail><L>125905</L><pc>7141-3</pc></tail></H1>

In all three occurrences, 'K4ANDRA' stands for 'candra'. There is no work like candrAloka referred here. First is vizRucandra Second and third are gaRita candragr(ahaRam??)

drdhaval2785 commented 8 years ago

19 SAM5KSHPAC2 - not found in pw.xml

20 DONNER,PIN2D2->DONNER,Pin2d2; This is the form in pw.xml. Otherwise it may be made capital.

drdhaval2785 commented 8 years ago

21 MAHA7B It is purported to be 'mahABAzya' according to pwbib0.txt (c.f. MAHA7BH for mahABArata). But it is not found in the pw.xml.

@funderburkjim may like to comment whether this has been lost as some programmatical conversion step or what?

EJF: The original digitization from Thomas is pwbib_orig.txt. pwbib0.txt was created by a program from this, so it is certainly possible that the program did some damage. The program steps to get pwbib0 are described in this readme document. I am treating pwbib0 as the current primary document, the one that will have corrections applied to it.

EJF: I suspect this is in the list of 'extra' bibliographical references (i.e., it appears in the pw bibliographies but is not referred to within the body of the PW dictionary). We probably should develop a list of these, and make use of this list in the crefmatch program, so we won't repeatedly worry why they don't match anything. I wonder why the author B. includes them in the bibliographies.

drdhaval2785 commented 8 years ago

22 C2RIMA7LA7M Not able to locate in pw.xml

drdhaval2785 commented 8 years ago

23 A7RUN2.Up->A7RUN2.UP

24 KAUSH.Up->KAUSH.UP

25 VIKR<OR>->VIKR. pw.xml also needs to be corrected from VIKROR to VIKR (total 52 entries)

capture capture

drdhaval2785 commented 8 years ago

26 Bydragen - not found in pw.xml

27 PRATIG4N4A7S(U7TRA) refers to PRATIG4N4A7S, which is already there in crefs.

drdhaval2785 commented 8 years ago

28 HARISV - Name of an author einer Tochter Harisva7min's

Line 128158: <H1><h><key1>suSIla</key1><key2>suSIla</key2><hom>2</hom></h><body><divm type="e" n="1">1)</divm> <gram n="Adj">Adj.</gram> <i>von guter Gemüthsart.</i> <ls>SPR.7140</ls> <noti>mit einer unbekannten Nebenbedeutung</noti> ; <noti>vgl.</noti> <s>suSIlavant</s> <gram n="Nom">Nom.</gram>abstr. <s>°tA</s> <gram n="f">f.</gram> <ls>KA7D.2,55,4(65,15</ls>). <divm type="e" n="2">2)</divm> <gram n="m">m.</gram> <noti>N.pr. verschiedener Personen.</noti> <divm type="e" n="3">3)</divm> <gram n="f">f.</gram> <s>A</s> <noti>N.pr.</noti> <divm type="n" n="a">a)</divm> <noti>einer Gattin Kr2shn2a's.</noti> <divm type="n" n="b">b)</divm> <noti>eines Wesens im Gefolge Ra7dha7.</noti> <divm type="n" n="c">c)</divm> <noti>der Gattin Jama's.</noti> <divm type="n" n="d">d)</divm> <noti>einer Tochter Harisva7min's.</noti> PW128152</body><tail><L>128154</L><pc>7170-1</pc></tail></H1>
drdhaval2785 commented 8 years ago

29 gan2a It is not a literary resource I guess. It is referring to gaRapAWa of pARini. Not able to locate it. It is always gaRaratnamahodaDi which comes up.

<H1>100{anuyuktin}1{*anuyuktin}¦ •Adj. •gan2a #{izwAdi}. PW4731
drdhaval2785 commented 8 years ago

30 KA7R->KA7RIKA7 There is only one occurrence. And pw.xml has KA7RIKA7.

drdhaval2785 commented 8 years ago

31 DAC2AK.(1925) It was catched because of wrong closure of brackets. Now must have gone away. No interference needed

drdhaval2785 commented 8 years ago

32 MA7N2D2Up->MA7N2D2.UP.

33 SVAPNAK4(INTA7MAN2I) - not able to locate in pwbib0.txt

drdhaval2785 commented 8 years ago

34 PRAKRIJA7K(AUMUDI),Hdschr.(AUFRECHT).RA7JENDR.Not->PRAKRIJA7K The rest seems to be explanation in some catalogue of Rajendra Mishra.

drdhaval2785 commented 8 years ago

35 VASISHT2HA,-> not able to locate it in pw.xml. (See the comma). It is used in pwbib0 to separate two editions.

drdhaval2785 commented 8 years ago

36 OppCat->OPP.CAT.

37 K4HA7NDOGJAP-> not able to locate in pw.xml. There is one K4ha7ndogjopanishad in the text.

38 KUHN'SZ->KUHN'S.Z.

gasyoun commented 8 years ago
funderburkjim commented 8 years ago

Re 17 SADDH.P.4->SADDH. I'm not sure what the story is here. Maybe @gasyoun or @zaaf2 or @thomasincambodia can help. There are two entries in the bibliography:

funderburkjim commented 8 years ago

I've added my two cents worth as 'subcomments' (identified by EJF) of Dhaval's initial comments in several. I've done this through his case 24 saddh. Will continue with the rest another time.

I'm using

funderburkjim commented 8 years ago

@gasyoun Could you check this correction to pwbib0:

old: ; VIKR. dra7v. == KA7LIDA7SA'S VIKRAMORVAC2IYAM nach dra7vidischen Handschriften, herausgegeben von RICHARD PISCHERD in. , Monatsbericht der Königlich Preussischen Akademie der Wissenschaften zu Berlin"1875, S. 609. fgg. (vol. 5)

new

.VIKR. dra7v. == KA7LIDA7SA'S VIKRAMORVAC2IYAM nach dra7vidischen Handschriften, herausgegeben von RICHARD PISCHEL in "Monatsbericht der Königlich Preussischen Akademie der Wissenschaften zu Berlin", 1875, S. 609. fgg. (vol. 5)

image

gasyoun commented 8 years ago

4te Kapitel = 4th Chapter, that means that Dhaval's assumption was wrong. 17 SADDH.P.4->SADDH not legal.

@funderburkjim RICHARD PISCHERD -> RICHARD PISCHEL

gasyoun commented 8 years ago

1925 = every 4 digit number starting with 18.. or 19.. should be left for further examination.

funderburkjim commented 8 years ago

Re 25 VIKR<OR>->VIKR.. Under headword kaYcukIya, I changed the text as follows:

old
¯VIKROR.ED. ¯PISCHEL.661,4.14.664,15.
new
¯VIKR.dra7v.661,4.14.664,15.

The reason is for consistency with the bibliography (pwbib) which shows that VIKR.dra7v is the PISCHEL edition in bibliography:

.VIKR. dra7v. == KA7LIDA7SA'S VIKRAMORVAC2IYAM nach dra7vidischen Handschriften, herausgegeben von RICHARD PISCHERD in. , Monatsbericht der Königlich Preussischen Akademie der Wissenschaften zu Berlin"1875, S. 609. fgg. (vol. 5)

funderburkjim commented 8 years ago

3015 corrections were generated for PW, and have been installed. These are consistent with the 'EJF' comments as shown above.

funderburkjim commented 8 years ago

Minor change to pwbib0.txt, 'Up.' -> 'UP.' These can be viewed as typos, since the printed text always has a lower case capital 'P'.

The ones marked as 'extra' were not mentioned in the issue comments above.

; A7RUN2. Up. -> A7RUN2.  UP. 
; DHJA7NAB. Up.  -> DHJA7NAB. UP.  (extra)
; KAUSH. Up. -> KAUSH.  UP.
; .K4HA7ND. Up.  -> .K4HA7ND. UP.  (extra)
; .NI7LAR. Up. -> .NI7LAR. UP. (extra)
; NR2S Up. -> NR2S UP.  (extra)
; 4 SAM5NJ.Up -> SAM5NJ.UP  in pwbib0
; .TAITT. Up. -> .TAITT. UP. (extra)
; GA7R. Up. -> GA7R. UP. (extra)
; HANUM. Up. -> HANUM. UP. (extra)
; JOGAC2.Up.-> JOGAC2.UP. (extra)
; 32 MA7N2D2 Up. -> MA7N2D2 UP. 
; NA7DAR. Up. -> NA7DAR. UP. (extra)
; TEG4OB. Up. -> TEG4OB. UP. (extra)
; MUN2D2. Up.-> MUN2D2. UP. (extra)
; RA7MAPU7RVAT. Up. -> RA7MAPU7RVAT. UP. (extra)
; KAN2T2HAC2R. Up. -> KAN2T2HAC2R. UP. (extra)
; zu BR2H. A7R Up. -> zu BR2H. A7R UP. (extra) (in text of abbreviation A7NANDAG)
funderburkjim commented 8 years ago

Additional changes/corrections to pwbib0, per issue cases above.

; 6 WEBER,GJOT. -> WEBER,G4JOT.
; 8 DEC2IN->DEC2I7N
; 11 .DIVJA7V<AD>.  -> .DIVJA7VAD. 
; 14 .LEUMANNA, Aup. Gl. -> .LEUMANN, Aup. Gl.  
; 15  KA7VJA7 (OKALOK4ANA), Hdschr. (AUFRECHT) -> KA7VJA7L ...  
; 16 .HILLEBR. . ->  .HILLEBR. N. 
; 19 .SAM5KSHPAC2 (AM5KARAG4AJA) von MA7DHAVA (AUFRECHT). -> SAM5KSHEPAC2
; 25 VIKR<OR>. -> VIKR.
; 30 KA7R->KA7RIKA7  
; 36 OppCat->OPP.CAT.  
   This is actually written 'OPP.Cat.' in both bibliography and print, but as OPP.CAT in pw.xml.  As a 
    short cut, I propose to change crefmatch to artificially capitalize this to force a match.
funderburkjim commented 8 years ago

Above changes to pwbib0 installed (committed) in PWK

funderburkjim commented 8 years ago

A crefmatch rerun now shows that 76% of pwbib0 abbreviations accounted for, and 83% of sortedcref instances accounted for. So, we're making some progress!

I've NOT yet dealt with these issues identified as pwbib1 problems:

; 34 .PRAKRIJA7K (AUMUDI), Hdschr. (AUFRECHT). RA7JENDR. Not. ==  pwbib1 problem
; 35 VASISHT2HA,    pwbib1 problem. remove comma.
; 38 KUHN'SZ->KUHN'S.Z.  pwbib1 
; 17 SADDH.P.4->SADDH.P.  pwbib1 change
; Noticed that 'G4' should be , in pwbib1.txt, converted to 'J' in pwbib1.txt
; 27 PRATIG4N4A7S(U7TRA) refers to PRATIG4N4A7S,  pwbib1 problem

or with these two, identified as needing adjustments to abbrv.py:

; 5 VET.(U.)  see error in abbrv.py
; 31 DAC2AK.(1925)  abbrv.py problem
funderburkjim commented 8 years ago

Here are the items currently identified as abbreviations appearing in the bibliography (pwbib) but having no examples in pw.xml:

21 MAHA7B
22 C2RIMA7LA7M
26 Bydragen 
28 HARISV
29 gan2a
33 SVAPNAK4(INTA7MAN2I) 
14 LEUMANNA,Aup.Gl 

We could call this pwbib_unused.txt, and make use of this list in doing crefmatch.

gasyoun commented 8 years ago

Not lower case capital 'P', but small caps "P". Otherwise accepted.

funderburkjim commented 8 years ago

A bit more progress:

['Mat.med','H.an','DAC2AK.(1925)','VET.(U.)',
   'VIKR.dra7v','PISCHEL,deGr.pr','Bibl.ind','KAP.(BALL.)']
gasyoun commented 8 years ago

Hmm, I'm lost. Have you found many cases from real life abbreviations that are additional to the lists given in Preface? Do I understand it right? I have lost myself in the terminology and files names, forgive me my misery.

drdhaval2785 commented 8 years ago

I agree. The issue is longer than what is manageable. So part 1 and close policy seems fine to me. @gasyoun Right now we are using comparision between pwbib (from Thomas) and sortedcrefs (from Dhaval) to weed out errors in both. That is why correcting each error gives some progress. Earlier 78% or so were matching. Now 85% are matching with these corrections. I guess after 90% matching, we need to do manual weeding out of obviously undeserving entries from sortedcrefs.txt.

Whatever remains in sortedcrefs.txt after these cleanups may be actual list of 'Additions' to bibliography which the author may have overlooked. Even if we don't get any such addition, cleaning is really what matters the most as of now.