sanskrit-lexicon / PWK

Sanskrit-Wörterbuch in kürzerer Fassung, 7 Bände Petersburg 1879-1889
3 stars 1 forks source link

Capitalization issue in literary resources #35

Closed drdhaval2785 closed 8 years ago

drdhaval2785 commented 8 years ago
¯Katha7s@AhUtavya@AhUtavya@16585:¯KATHA7S:t:Capitalization issue

This correction submission at https://github.com/sanskrit-lexicon/PWK/issues/34#issuecomment-165970613 has thrown a possibility that there are non-capital letters in the literary resources.

If they can be corrected in one go - we can save a lot of time which would otherwise be gone in such submissions. Maybe some regex like ¯[A-Z0-9]*[a-z]+[^ ]* should work. (7811 entries) Code is https://github.com/sanskrit-lexicon/PWK/blob/master/pw_ls/pw_dhaval/abbrvwork/capitalize/capital.py and output is https://github.com/sanskrit-lexicon/PWK/blob/master/pw_ls/pw_dhaval/abbrvwork/capitalize/cap0.txt

There is one danger point too. There ARE entries which have small letters in them for references too. e.g.

¯Ind.St.
¯Spr.
¯Lot. de la b. l.

etc. But they are sparse and few. If the regex can be modified to exclude them or some tiny script written to discard them, we would get readymade list for capitalization.

gasyoun commented 8 years ago

Spr.[uche], right, no need to capitalise. Far to many for capital work - and I should admit it's rather formal work as well.

funderburkjim commented 8 years ago

Wrote a 'capmatch' program. This generated 'change' records from sortedcrefs.txt for those cases where

  1. The cref abbreviation contained some lower-case letter(s)
  2. After capitalization, this abbreviation matched some pwbib abbreviation.

173 cases fit these criteria.

Then, the change, from uncapitalized to capitalized, was made wherever it occurred in literary source segment of pw.txt.

1372 changes to pw.txt.

funderburkjim commented 8 years ago

Here are the 173 change patterns:

4 changes for  ¯C2at.Br. => ¯C2AT.BR.
1 changes for  ¯AIT.Br. => ¯AIT.BR.
8 changes for  ¯Katha7s. => ¯KATHA7S.
11 changes for  ¯C2A7N5KH.Br. => ¯C2A7N5KH.BR.
1 changes for  ¯La. => ¯LA.
1 changes for  ¯Divja7vad. => ¯DIVJA7VAD.
1 changes for  ¯C2kdr. => ¯C2KDR.
2 changes for  ¯Gobh. => ¯GOBH.
1 changes for  ¯Bha7m. => ¯BHA7M.
3 changes for  ¯Ja7g4n4. => ¯JA7G4N4.
5 changes for  ¯La7t2j. => ¯LA7T2J.
1 changes for  ¯AV.G4jot. => ¯AV.G4JOT.
3 changes for  ¯C2A7N5KH.Gr2hj. => ¯C2A7N5KH.GR2HJ.
2 changes for  ¯Nja7jas. => ¯NJA7JAS.
1 changes for  ¯maha7virak4. => ¯MAHA7VIRAK4.
2 changes for  ¯Kalpas. => ¯KALPAS.
1 changes for  ¯Gop.BR. => ¯GOP.BR.
1 changes for  ¯SV.A7r. => ¯SV.A7R.
2 changes for  ¯Gop.Br. => ¯GOP.BR.
1 changes for  ¯C2i7la7n5ka. => ¯C2I7LA7N5KA.
2 changes for  ¯Gr2hja7s. => ¯GR2HJA7S.
7 changes for  ¯Ka7t2h. => ¯KA7T2H.
1 changes for  ¯Dac2ar. => ¯DAC2AR.
9 changes for  ¯Dac2ak. => ¯DAC2AK.
1 changes for  ¯Ka7lak4. => ¯KA7LAK4.
1 changes for  ¯HEM.Jog. => ¯HEM.JOG.
1 changes for  ¯BR2H.A7r.Up. => ¯BR2H.A7R.UP.
1 changes for  ¯HEM.JoG. => ¯HEM.JOG.
1 changes for  ¯BR2H.A7r.UP. => ¯BR2H.A7R.UP.
1 changes for  ¯R2v.Pra7t. => ¯R2V.PRA7T.
4 changes for  ¯Bha7g.P. => ¯BHA7G.P.
4 changes for  ¯A7c2v.C2R. => ¯A7C2V.C2R.
1 changes for  ¯C2AT.Br. => ¯C2AT.BR.
2 changes for  ¯Megh. => ¯MEGH.
76 changes for  ¯Vp. => ¯VP.
2 changes for  ¯G4a7takam. => ¯G4A7TAKAM.
2 changes for  ¯Vara7h.Jogaj. => ¯VARA7H.JOGAJ.
9 changes for  ¯Gal. => ¯GAL.
6 changes for  ¯Pan4k4ad. => ¯PAN4K4AD.
5 changes for  ¯K4HA7ND.Up. => ¯K4HA7ND.UP.
4 changes for  ¯Vop. => ¯VOP.
3 changes for  ¯Pan4k4at. => ¯PAN4K4AT.
6 changes for  ¯G4aim. => ¯G4AIM.
2 changes for  ¯A7c2v.Gr2hj. => ¯A7C2V.GR2HJ.
1 changes for  ¯Ra7G4AT. => ¯RA7G4AT.
1 changes for  ¯Ra7G4AN. => ¯RA7G4AN.
3 changes for  ¯Va7stuv. => ¯VA7STUV.
1 changes for  ¯Pr.P. => ¯PR.P.
3 changes for  ¯Sa7j. => ¯SA7J.
4 changes for  ¯Gola7dhj. => ¯GOLA7DHJ.
86 changes for  ¯HEM.Par. => ¯HEM.PAR.
1 changes for  ¯C2a7c2vata. => ¯C2A7C2VATA.
14 changes for  ¯Lalit. => ¯LALIT.
11 changes for  ¯Suc2r. => ¯SUC2R.
1 changes for  ¯Trik. => ¯TRIK.
4 changes for  ¯Kap. => ¯KAP.
2 changes for  ¯Va7sav. => ¯VA7SAV.
3 changes for  ¯A7rjabh. => ¯A7RJABH.
1 changes for  ¯A7rjav. => ¯A7RJAV.
41 changes for  ¯K4araka. => ¯K4ARAKA.
1 changes for  ¯Ka7vjapr. => ¯KA7VJAPR.
2 changes for  ¯Ma7lav. => ¯MA7LAV.
4 changes for  ¯AV.Paric2. => ¯AV.PARIC2.
2 changes for  ¯Govinda7n. => ¯GOVINDA7N.
1 changes for  ¯B.a.J. => ¯B.A.J.
6 changes for  ¯Ka7ran2d2. => ¯KA7RAN2D2.
3 changes for  ¯Sarvad. => ¯SARVAD.
1 changes for  ¯Mantrabr. => ¯MANTRABR.
1 changes for  ¯K4ha7nd.UP. => ¯K4HA7ND.UP.
4 changes for  ¯Vikrama7n5kak4. => ¯VIKRAMA7N5KAK4.
31 changes for  ¯Bha7vapr. => ¯BHA7VAPR.
1 changes for  ¯Ha7sj. => ¯HA7SJ.
2 changes for  ¯Dhu7rtan. => ¯DHU7RTAN.
2 changes for  ¯BüHL.GUZ. => ¯BÜHL.GUZ.
1 changes for  ¯Taitt.UP. => ¯TAITT.UP.
15 changes for  ¯Ts. => ¯TS.
2 changes for  ¯TS.Pra7t. => ¯TS.PRA7T.
1 changes for  ¯PA7R.Gr2hj. => ¯PA7R.GR2HJ.
145 changes for  ¯R2v. => ¯R2V.
1 changes for  ¯Saddh.P. => ¯SADDH.P.
1 changes for  ¯Med. => ¯MED.
5 changes for  ¯Madanav. => ¯MADANAV.
3 changes for  ¯GAN2IT.Bhagan2. => ¯GAN2IT.BHAGAN2.
4 changes for  ¯AV.Paipp. => ¯AV.PAIPP.
5 changes for  ¯Prasannar. => ¯PRASANNAR.
2 changes for  ¯C2a7n5kh.C2R. => ¯C2A7N5KH.C2R.
54 changes for  ¯Ba7dar. => ¯BA7DAR.
1 changes for  ¯Gan2it.ADHIM. => ¯GAN2IT.ADHIM.
5 changes for  ¯C2A7RN5G.Sam5h. => ¯C2A7RN5G.SAM5H.
2 changes for  ¯Mahi7dh. => ¯MAHI7DH.
1 changes for  ¯Kaush.Up. => ¯KAUSH.UP.
15 changes for  ¯Ak. => ¯AK.
1 changes for  ¯Su7rjas. => ¯SU7RJAS.
5 changes for  ¯C2a7k. => ¯C2A7K.
2 changes for  ¯Prata7par. => ¯PRATA7PAR.
66 changes for  ¯Av. => ¯AV.
3 changes for  ¯Kull. => ¯KULL.
1 changes for  ¯Müller,SL. => ¯MÜLLER,SL.
34 changes for  ¯Ba7lar. => ¯BA7LAR.
1 changes for  ¯Ka7c2i7kh. => ¯KA7C2I7KH.
2 changes for  ¯C2a7rn5g.Sam5h. => ¯C2A7RN5G.SAM5H.
32 changes for  ¯Ka7d. => ¯KA7D.
1 changes for  ¯Ka7t. => ¯KA7T.
7 changes for  ¯ba7dar. => ¯BA7DAR.
1 changes for  ¯C2iva-P. => ¯C2IVA-P.
7 changes for  ¯Ni7lak. => ¯NI7LAK.
9 changes for  ¯Ka7c2. => ¯KA7C2.
22 changes for  ¯A7past. => ¯A7PAST.
8 changes for  ¯Vaita7n. => ¯VAITA7N.
1 changes for  ¯HARSHAk4. => ¯HARSHAK4.
2 changes for  ¯Mudra7r. => ¯MUDRA7R.
1 changes for  ¯KAN2T2HAC2R.Up. => ¯KAN2T2HAC2R.UP.
1 changes for  ¯Nir. => ¯NIR.
3 changes for  ¯C2am5k. => ¯C2AM5K.
2 changes for  ¯Ratnam. => ¯RATNAM.
1 changes for  ¯Harshak4. => ¯HARSHAK4.
1 changes for  ¯Un2a7dis. => ¯UN2A7DIS.
2 changes for  ¯Suparn2. => ¯SUPARN2.
1 changes for  ¯K4an2d2ak. => ¯K4AN2D2AK.
3 changes for  ¯AIT.Up. => ¯AIT.UP.
44 changes for  ¯Mbh. => ¯MBH.
1 changes for  ¯Vag4rak4k4h. => ¯VAG4RAK4K4H.
2 changes for  ¯Kir. => ¯KIR.
8 changes for  ¯Nja7jam. => ¯NJA7JAM.
2 changes for  ¯Ait.A7r. => ¯AIT.A7R.
1 changes for  ¯Prab. => ¯PRAB.
1 changes for  ¯SUc2r. => ¯SUC2R.
2 changes for  ¯Ragh. => ¯RAGH.
2 changes for  ¯A7c2v.C2r. => ¯A7C2V.C2R.
7 changes for  ¯C2ic2. => ¯C2IC2.
1 changes for  ¯PA7R.GR2hj. => ¯PA7R.GR2HJ.
40 changes for  ¯Hema7dri. => ¯HEMA7DRI.
1 changes for  ¯Prij. => ¯PRIJ.
1 changes for  ¯Bhag. => ¯BHAG.
5 changes for  ¯Bhat2t2. => ¯BHAT2T2.
2 changes for  ¯C2Ic2. => ¯C2IC2.
1 changes for  ¯Jogas. => ¯JOGAS.
2 changes for  ¯Suc2R. => ¯SUC2R.
3 changes for  ¯Agni-P. => ¯AGNI-P.
2 changes for  ¯C2kDr. => ¯C2KDR.
1 changes for  ¯RA7g4an. => ¯RA7G4AN.
8 changes for  ¯Maitr.S. => ¯MAITR.S.
2 changes for  ¯Ka7m.Ni7tis. => ¯KA7M.NI7TIS.
2 changes for  ¯Viddh. => ¯VIDDH.
1 changes for  ¯BHA7vapr. => ¯BHA7VAPR.
13 changes for  ¯Hariv. => ¯HARIV.
1 changes for  ¯Kauc2. => ¯KAUC2.
1 changes for  ¯Sa7mav.Br. => ¯SA7MAV.BR.
4 changes for  ¯Nj.K. => ¯NJ.K.
1 changes for  ¯Sa7mav.BR. => ¯SA7MAV.BR.
6 changes for  ¯Ta7n2d2ja-br. => ¯TA7N2D2JA-BR.
6 changes for  ¯Ma7n.C2R. => ¯MA7N.C2R.
5 changes for  ¯Gaut. => ¯GAUT.
1 changes for  ¯Bhog4a-K4ar. => ¯BHOG4A-K4AR.
1 changes for  ¯Kuma7rasv. => ¯KUMA7RASV.
1 changes for  ¯Lokapr. => ¯LOKAPR.
8 changes for  ¯Vs. => ¯VS.
1 changes for  ¯Vjutp. => ¯VJUTP.
1 changes for  ¯MBh. => ¯MBH.
1 changes for  ¯Sam5nj.Up. => ¯SAM5NJ.UP.
3 changes for  ¯Ven2i7s. => ¯VEN2I7S.
1 changes for  ¯Tattvas. => ¯TATTVAS.
5 changes for  ¯Ra7g4at. => ¯RA7G4AT.
221 changes for  ¯Ra7g4an. => ¯RA7G4AN.
1 changes for  ¯GoBH. => ¯GOBH.
1 changes for  ¯Shad2v.Br. => ¯SHAD2V.BR.
2 changes for  ¯Sam5hitopan. => ¯SAM5HITOPAN.
4 changes for  ¯Mr2k4k4h. => ¯MR2K4K4H.
5 changes for  ¯Hem.Par. => ¯HEM.PAR.
3 changes for  ¯C2ulbas. => ¯C2ULBAS.
1 changes for  ¯AMR2T.Up. => ¯AMR2T.UP.
3 changes for  ¯A7past.C2R. => ¯A7PAST.C2R.
9 changes for  ¯A7PAST.C2r. => ¯A7PAST.C2R.
funderburkjim commented 8 years ago

PWK programs rerun.

13 remain in bibminuscref, two fewer than at #26

Some progress in abbrvlist matching.

Previously (#26) 64092 out of 73111 cases (87.6%)

Now, 65479 out of 73117 cases (89.6%).

Needless to say, the print was not checked individually in these cases. However, from previous work, I think it highly probable that all of these were indeed typos (OCR) type errors in pw.txt.

drdhaval2785 commented 8 years ago

It is good presumption to treat them as typo. Because their capital counterparts are already there in pw.txt.

So either must be wrong, or at least inconsistent.

It seems that these literary resources corrections are turning out to be the largest corrections that have been inserted in the Cologne dictionaries in recent past.

Good work @funderburkjim.

funderburkjim commented 8 years ago

The next step will be to try something similar, but this time focusing first on cases where there are errors in the AS-numbers of pw citations. Then, it might be appropriate to try a 'fuzzy match' approach to see if that can identify some more likely corrections to pw.

drdhaval2785 commented 8 years ago

The files in https://github.com/sanskrit-lexicon/PWK/tree/master/pw_ls/pwbib/diffstudy/correctionsubmission have been updated based on corrections. Also http://sanskrit-lexicon.github.io/PWK/cmbsub.html (Decreased from 2245 to 2044 lines) and http://sanskrit-lexicon.github.io/PWK/cbisub.html have been regenerated.

The script to run is https://github.com/sanskrit-lexicon/PWK/blob/master/pw_ls/pw_dhaval/abbrvwork/stdabbrv.sh

@funderburkjim Can you incorporate this shell file in your process of regenerating sortedcrefs.txt etc files ? That way the whole process till the standard format derivation for correction submission can be handled by you only. I don't have to regenerate it. The actual submission file is a manual copy paste of cmbsub.txt file - copied to crefminusbibsubmission.txt manually. Therefore, machine regeneration of cmbsub.txt file would not alter our precious crefminusbibsubmission.txt file. Once you do this, I will close this issue.

drdhaval2785 commented 8 years ago

@funderburkjim It seems that 5 records have been missed.

After correction installations, I reran the script capredo.sh.

The output is

¯Bi7g4ag.@¯BI7G4AG.@t@capitalization
¯Ka7c2.@¯KA7C2.@t@capitalization
¯C2am5k.@¯C2AM5K.@t@capitalization
¯Maha7vi7rak4.@¯MAHA7VI7RAK4.@t@capitalization
¯R2v.@¯R2V.@t@capitalization

Total 5 entries need correction yet. Please complete this correction.

funderburkjim commented 8 years ago

Glad you recomputed. Here's the changes generate. Not yet installed

¯Bi7g4ag.@¯BI7G4AG.@t@capitalization
; ¯Ka7c2.@¯KA7C2.@t@capitalization PREVIOUSLY CORRECTED
;¯C2am5k.@¯C2AM5K.@t@capitalization PREVIOUSLY CORRECTED
¯Maha7vi7rak4.@¯MAHA7VI7RAK4.@t@capitalization
;¯R2v.@¯R2V.@t@capitalization PREVIOUSLY CORRECTED
¯CAUS.R2v.@‹Caus.› ¯R2V.@t@ 'Caus.' = Causative, not a reference
funderburkjim commented 8 years ago

Corrections now installed.

funderburkjim commented 8 years ago

Rechecked Dhaval's list of 5 vs. current pw.txt Found two that I missed:

¯C2am5k
¯R2v

These were missed since they do not end in a period in pw.txt. Will correct as part of next batch.

drdhaval2785 commented 8 years ago

@funderburkjim For corrections since last 5 days, there has been no update in manualByLine4.txt for PW syncing. and Corrections closed in the last 2 days don't have corresponding commits made in github.

Finish these two events ASAP.

funderburkjim commented 8 years ago

@drdhaval2785

Not sure why you are seeing a problem.

I AM regenerating 'pwsync.zip' as part of the PW update process. The last date of that file, at Cologne, is 1/7/2016. I just downloaded pwsync.zip and looked at manualyByLine04 from there. It's last transaction is:

; pw, Issue 236, Case 327, user=dhavel_ejf,
; 01/08/2016, L=135775, key1=hvala
; ¯PA7R.¯Gr2hj -> ¯PA7R.GR2HJ :: typo 
281043 old <H1>100{hvala}1{hvala}¦ ²1) •Adj. {%strauchelnd , taumelnd%} ¯PA7R.¯Gr2hj.3,7,3. ²2) •f. #{A/} {%das Irren , Verfehlen , Verunglücken.%} PW135773
; new
281043 new <H1>100{hvala}1{hvala}¦ ²1) •Adj. {%strauchelnd , taumelnd%} ¯PA7R.GR2HJ.3,7,3. ²2) •f. #{A/} {%das Irren , Verfehlen , Verunglücken.%} PW135773

which looks right.

What is your process for updating using pwsync.zip? There may be some detail in this process that is not quite right.

gasyoun commented 8 years ago

@drdhaval2785 have you got the 1/7/2016 file?