sanskrit-lexicon / PWK

Sanskrit-Wörterbuch in kürzerer Fassung, 7 Bände Petersburg 1879-1889
3 stars 1 forks source link

Abbreviations bracket not proper #12

Open drdhaval2785 opened 9 years ago

drdhaval2785 commented 9 years ago

<ls>\(([^\)]+)</ls>\) This regex caught the occurrence of 304 such improperly closed brackets from pw.xml like <ls>(blah blah.....</ls>) See the bracket starting in and ending beyond . It has to be either (<ls>....</ls>) or <ls>(.....)</ls>. @funderburkjim would you have a look at them ?

drdhaval2785 commented 9 years ago

For example, see

<H1><h><key1>aputrin</key1><key2>aputrin</key2></h><body><ls>(108,6</ls>), <s>aputriya</s> <noti>und</noti> <s>aputrya</s> <gram n="Adj">Adj.</gram> <i>sohnlos , kinderlos</i> <ls>MANTRABR.1,4,2.</ls> <ls>C2A7N5KH.GR2HJ.1,18.</ls> PW6536</body><tail><L>6536</L><pc>1077-1</pc></tail></H1>

Specifically <ls>(108,6</ls>) is where I want to draw your attention.

drdhaval2785 commented 9 years ago

At some locations, extreme issues are seen like <ls>(K.</ls>).1,143,1. All the numbers etc sent outside. It seems to me that it is solvable computationally. It seems that the error is in the tagging code which made this pw.xml. Jim would like to clarify.

Note - How did I find them ? Sorting the literary resources for https://github.com/sanskrit-lexicon/CORRECTIONS/issues/143 gave me the following.

(4A7rajav
(4Karaka
(4Ra7g4at
(K.
(Mat.Med
(PISCH.

Their analysis showed this flaw. Then devised the regex to catch all of them.

gasyoun commented 9 years ago

This is a tertiary issue, but indeed - pure RegEx magic would help.

funderburkjim commented 9 years ago

note 1. There is an error in the make_xml.py program in constructing the <ls> element.

x = re.sub(u'¯([^¯ \)\]<>]*)',r'<ls>\1</ls>',x)

The error is that the presence of a right-paren is among those characters that terminate the scope of the <ls> element. The following replacement looks about right to me:

 x = re.sub(u'¯([^ <¯]*)',r'<ls>\1</ls>',x)  

I see no harm in using this (and have so modified the construction of pw.xml). This corrects examples such as this under headword akarmaRya:

old:  <ls>R.2,64,33(34</ls>).
new.  <ls>R.2,64,33(34).</ls>

@drdhaval2785 I wasn't sure how to rerun makeabbrv.sh in PWK/pw_ls/pw_dhaval/abbrvwork/ (where do you put the pw.xml file for rerunning?) New versions of the pw environment (orig, pywork,web1) and also of the standard pw downloads have been made. Maybe you should rerun. For comparison purposes (in case my revised pw.xml has some undesireable properties that I missed), I suggest you keep a copy of the old outputs temporarily. Then, if new version looks better, post new files to this Github repository.

note 2: This does not solve all the problems by any means. For example (from pw.txt):

Example 1:
<H1>107{AmraganDaka}1{*AmraganDaka}¦ #{ganDakft} (¯GAL.) #{°ganDaDfk} 
(¯RA7G4AN.)4,21) 
‹und› #{*°ganDaDft} (¯NIGH.PR.) •m. {%eine best. Pflanze.%} PW15248

Example 2:
<H1>000{AsTat}1{A/sTat}¦ ‹3te •Sg. und› #{AsTatAm} ‹)› ¯Bhat2t2.15,91)3te •Du. ‹•Aor. von› 
#{as , asyati}. PW16451

In Both these cases, there are certain digitization errors.

By one programmatic estimate, there are about 500 of these. Their correction can probably be done without reference to the scans, but still, 500 is a moderately large number of records to change by hand.

gasyoun commented 9 years ago

500 is a question of weeks, a big task if not done 8 hours every day.

funderburkjim commented 9 years ago

@drdhaval2785 I found where to put pw.xml and have rerun makeabbrv.sh (minor change to abbrv.py re the VP^2)

Generated a set of pw update transactions for bracket cases, and am working through the changes (about 250) by hand. Will take a while.

gasyoun commented 9 years ago

No to hand crafted lists :sleeping:

funderburkjim commented 9 years ago

All (I hope) of the misclosed brackets are now accounted for, due to a batch of corrections installed.

In the process, I made a couple of minor changes to abbrv.py.

pw_ls/pw_dhaval/abbrvwork/makeabbrv.sh was rerun, with the modified pw.xml.

sortedcrefs.txt now has 2725 lines, compared to the former 3339 lines. And none of the lines has any of `()[]' as part of the literary source abbreviation.

It would be possible to reduce this number by another 100+ via further adjustment to the 'clean' computation of abbrv.py.

The major next step I think should be an attempt to match the 'known' abbreviations (from pwbib) to those in sortedcrefs.txt, and this will probably be the next thing I'll attend to, after a bit of catchup to some pending correction submission issues.

gasyoun commented 9 years ago

none of the lines has any of `()[]' as part of the literary source abbreviation - good news. attempt to match the 'known' abbreviations (from pwbib) to those in sortedcrefs.txt - can I take it? And you take extraction of praefix+root from PW & PWG, please?