Closed drdhaval2785 closed 8 years ago
Spr.[uche], right, no need to capitalise. Far to many for capital work - and I should admit it's rather formal work as well.
Wrote a 'capmatch' program. This generated 'change' records from sortedcrefs.txt for those cases where
173 cases fit these criteria.
Then, the change, from uncapitalized to capitalized, was made wherever it occurred in literary source segment of pw.txt.
1372 changes to pw.txt.
Here are the 173 change patterns:
4 changes for ¯C2at.Br. => ¯C2AT.BR.
1 changes for ¯AIT.Br. => ¯AIT.BR.
8 changes for ¯Katha7s. => ¯KATHA7S.
11 changes for ¯C2A7N5KH.Br. => ¯C2A7N5KH.BR.
1 changes for ¯La. => ¯LA.
1 changes for ¯Divja7vad. => ¯DIVJA7VAD.
1 changes for ¯C2kdr. => ¯C2KDR.
2 changes for ¯Gobh. => ¯GOBH.
1 changes for ¯Bha7m. => ¯BHA7M.
3 changes for ¯Ja7g4n4. => ¯JA7G4N4.
5 changes for ¯La7t2j. => ¯LA7T2J.
1 changes for ¯AV.G4jot. => ¯AV.G4JOT.
3 changes for ¯C2A7N5KH.Gr2hj. => ¯C2A7N5KH.GR2HJ.
2 changes for ¯Nja7jas. => ¯NJA7JAS.
1 changes for ¯maha7virak4. => ¯MAHA7VIRAK4.
2 changes for ¯Kalpas. => ¯KALPAS.
1 changes for ¯Gop.BR. => ¯GOP.BR.
1 changes for ¯SV.A7r. => ¯SV.A7R.
2 changes for ¯Gop.Br. => ¯GOP.BR.
1 changes for ¯C2i7la7n5ka. => ¯C2I7LA7N5KA.
2 changes for ¯Gr2hja7s. => ¯GR2HJA7S.
7 changes for ¯Ka7t2h. => ¯KA7T2H.
1 changes for ¯Dac2ar. => ¯DAC2AR.
9 changes for ¯Dac2ak. => ¯DAC2AK.
1 changes for ¯Ka7lak4. => ¯KA7LAK4.
1 changes for ¯HEM.Jog. => ¯HEM.JOG.
1 changes for ¯BR2H.A7r.Up. => ¯BR2H.A7R.UP.
1 changes for ¯HEM.JoG. => ¯HEM.JOG.
1 changes for ¯BR2H.A7r.UP. => ¯BR2H.A7R.UP.
1 changes for ¯R2v.Pra7t. => ¯R2V.PRA7T.
4 changes for ¯Bha7g.P. => ¯BHA7G.P.
4 changes for ¯A7c2v.C2R. => ¯A7C2V.C2R.
1 changes for ¯C2AT.Br. => ¯C2AT.BR.
2 changes for ¯Megh. => ¯MEGH.
76 changes for ¯Vp. => ¯VP.
2 changes for ¯G4a7takam. => ¯G4A7TAKAM.
2 changes for ¯Vara7h.Jogaj. => ¯VARA7H.JOGAJ.
9 changes for ¯Gal. => ¯GAL.
6 changes for ¯Pan4k4ad. => ¯PAN4K4AD.
5 changes for ¯K4HA7ND.Up. => ¯K4HA7ND.UP.
4 changes for ¯Vop. => ¯VOP.
3 changes for ¯Pan4k4at. => ¯PAN4K4AT.
6 changes for ¯G4aim. => ¯G4AIM.
2 changes for ¯A7c2v.Gr2hj. => ¯A7C2V.GR2HJ.
1 changes for ¯Ra7G4AT. => ¯RA7G4AT.
1 changes for ¯Ra7G4AN. => ¯RA7G4AN.
3 changes for ¯Va7stuv. => ¯VA7STUV.
1 changes for ¯Pr.P. => ¯PR.P.
3 changes for ¯Sa7j. => ¯SA7J.
4 changes for ¯Gola7dhj. => ¯GOLA7DHJ.
86 changes for ¯HEM.Par. => ¯HEM.PAR.
1 changes for ¯C2a7c2vata. => ¯C2A7C2VATA.
14 changes for ¯Lalit. => ¯LALIT.
11 changes for ¯Suc2r. => ¯SUC2R.
1 changes for ¯Trik. => ¯TRIK.
4 changes for ¯Kap. => ¯KAP.
2 changes for ¯Va7sav. => ¯VA7SAV.
3 changes for ¯A7rjabh. => ¯A7RJABH.
1 changes for ¯A7rjav. => ¯A7RJAV.
41 changes for ¯K4araka. => ¯K4ARAKA.
1 changes for ¯Ka7vjapr. => ¯KA7VJAPR.
2 changes for ¯Ma7lav. => ¯MA7LAV.
4 changes for ¯AV.Paric2. => ¯AV.PARIC2.
2 changes for ¯Govinda7n. => ¯GOVINDA7N.
1 changes for ¯B.a.J. => ¯B.A.J.
6 changes for ¯Ka7ran2d2. => ¯KA7RAN2D2.
3 changes for ¯Sarvad. => ¯SARVAD.
1 changes for ¯Mantrabr. => ¯MANTRABR.
1 changes for ¯K4ha7nd.UP. => ¯K4HA7ND.UP.
4 changes for ¯Vikrama7n5kak4. => ¯VIKRAMA7N5KAK4.
31 changes for ¯Bha7vapr. => ¯BHA7VAPR.
1 changes for ¯Ha7sj. => ¯HA7SJ.
2 changes for ¯Dhu7rtan. => ¯DHU7RTAN.
2 changes for ¯BüHL.GUZ. => ¯BÜHL.GUZ.
1 changes for ¯Taitt.UP. => ¯TAITT.UP.
15 changes for ¯Ts. => ¯TS.
2 changes for ¯TS.Pra7t. => ¯TS.PRA7T.
1 changes for ¯PA7R.Gr2hj. => ¯PA7R.GR2HJ.
145 changes for ¯R2v. => ¯R2V.
1 changes for ¯Saddh.P. => ¯SADDH.P.
1 changes for ¯Med. => ¯MED.
5 changes for ¯Madanav. => ¯MADANAV.
3 changes for ¯GAN2IT.Bhagan2. => ¯GAN2IT.BHAGAN2.
4 changes for ¯AV.Paipp. => ¯AV.PAIPP.
5 changes for ¯Prasannar. => ¯PRASANNAR.
2 changes for ¯C2a7n5kh.C2R. => ¯C2A7N5KH.C2R.
54 changes for ¯Ba7dar. => ¯BA7DAR.
1 changes for ¯Gan2it.ADHIM. => ¯GAN2IT.ADHIM.
5 changes for ¯C2A7RN5G.Sam5h. => ¯C2A7RN5G.SAM5H.
2 changes for ¯Mahi7dh. => ¯MAHI7DH.
1 changes for ¯Kaush.Up. => ¯KAUSH.UP.
15 changes for ¯Ak. => ¯AK.
1 changes for ¯Su7rjas. => ¯SU7RJAS.
5 changes for ¯C2a7k. => ¯C2A7K.
2 changes for ¯Prata7par. => ¯PRATA7PAR.
66 changes for ¯Av. => ¯AV.
3 changes for ¯Kull. => ¯KULL.
1 changes for ¯Müller,SL. => ¯MÜLLER,SL.
34 changes for ¯Ba7lar. => ¯BA7LAR.
1 changes for ¯Ka7c2i7kh. => ¯KA7C2I7KH.
2 changes for ¯C2a7rn5g.Sam5h. => ¯C2A7RN5G.SAM5H.
32 changes for ¯Ka7d. => ¯KA7D.
1 changes for ¯Ka7t. => ¯KA7T.
7 changes for ¯ba7dar. => ¯BA7DAR.
1 changes for ¯C2iva-P. => ¯C2IVA-P.
7 changes for ¯Ni7lak. => ¯NI7LAK.
9 changes for ¯Ka7c2. => ¯KA7C2.
22 changes for ¯A7past. => ¯A7PAST.
8 changes for ¯Vaita7n. => ¯VAITA7N.
1 changes for ¯HARSHAk4. => ¯HARSHAK4.
2 changes for ¯Mudra7r. => ¯MUDRA7R.
1 changes for ¯KAN2T2HAC2R.Up. => ¯KAN2T2HAC2R.UP.
1 changes for ¯Nir. => ¯NIR.
3 changes for ¯C2am5k. => ¯C2AM5K.
2 changes for ¯Ratnam. => ¯RATNAM.
1 changes for ¯Harshak4. => ¯HARSHAK4.
1 changes for ¯Un2a7dis. => ¯UN2A7DIS.
2 changes for ¯Suparn2. => ¯SUPARN2.
1 changes for ¯K4an2d2ak. => ¯K4AN2D2AK.
3 changes for ¯AIT.Up. => ¯AIT.UP.
44 changes for ¯Mbh. => ¯MBH.
1 changes for ¯Vag4rak4k4h. => ¯VAG4RAK4K4H.
2 changes for ¯Kir. => ¯KIR.
8 changes for ¯Nja7jam. => ¯NJA7JAM.
2 changes for ¯Ait.A7r. => ¯AIT.A7R.
1 changes for ¯Prab. => ¯PRAB.
1 changes for ¯SUc2r. => ¯SUC2R.
2 changes for ¯Ragh. => ¯RAGH.
2 changes for ¯A7c2v.C2r. => ¯A7C2V.C2R.
7 changes for ¯C2ic2. => ¯C2IC2.
1 changes for ¯PA7R.GR2hj. => ¯PA7R.GR2HJ.
40 changes for ¯Hema7dri. => ¯HEMA7DRI.
1 changes for ¯Prij. => ¯PRIJ.
1 changes for ¯Bhag. => ¯BHAG.
5 changes for ¯Bhat2t2. => ¯BHAT2T2.
2 changes for ¯C2Ic2. => ¯C2IC2.
1 changes for ¯Jogas. => ¯JOGAS.
2 changes for ¯Suc2R. => ¯SUC2R.
3 changes for ¯Agni-P. => ¯AGNI-P.
2 changes for ¯C2kDr. => ¯C2KDR.
1 changes for ¯RA7g4an. => ¯RA7G4AN.
8 changes for ¯Maitr.S. => ¯MAITR.S.
2 changes for ¯Ka7m.Ni7tis. => ¯KA7M.NI7TIS.
2 changes for ¯Viddh. => ¯VIDDH.
1 changes for ¯BHA7vapr. => ¯BHA7VAPR.
13 changes for ¯Hariv. => ¯HARIV.
1 changes for ¯Kauc2. => ¯KAUC2.
1 changes for ¯Sa7mav.Br. => ¯SA7MAV.BR.
4 changes for ¯Nj.K. => ¯NJ.K.
1 changes for ¯Sa7mav.BR. => ¯SA7MAV.BR.
6 changes for ¯Ta7n2d2ja-br. => ¯TA7N2D2JA-BR.
6 changes for ¯Ma7n.C2R. => ¯MA7N.C2R.
5 changes for ¯Gaut. => ¯GAUT.
1 changes for ¯Bhog4a-K4ar. => ¯BHOG4A-K4AR.
1 changes for ¯Kuma7rasv. => ¯KUMA7RASV.
1 changes for ¯Lokapr. => ¯LOKAPR.
8 changes for ¯Vs. => ¯VS.
1 changes for ¯Vjutp. => ¯VJUTP.
1 changes for ¯MBh. => ¯MBH.
1 changes for ¯Sam5nj.Up. => ¯SAM5NJ.UP.
3 changes for ¯Ven2i7s. => ¯VEN2I7S.
1 changes for ¯Tattvas. => ¯TATTVAS.
5 changes for ¯Ra7g4at. => ¯RA7G4AT.
221 changes for ¯Ra7g4an. => ¯RA7G4AN.
1 changes for ¯GoBH. => ¯GOBH.
1 changes for ¯Shad2v.Br. => ¯SHAD2V.BR.
2 changes for ¯Sam5hitopan. => ¯SAM5HITOPAN.
4 changes for ¯Mr2k4k4h. => ¯MR2K4K4H.
5 changes for ¯Hem.Par. => ¯HEM.PAR.
3 changes for ¯C2ulbas. => ¯C2ULBAS.
1 changes for ¯AMR2T.Up. => ¯AMR2T.UP.
3 changes for ¯A7past.C2R. => ¯A7PAST.C2R.
9 changes for ¯A7PAST.C2r. => ¯A7PAST.C2R.
PWK programs rerun.
13 remain in bibminuscref, two fewer than at #26
Some progress in abbrvlist matching.
Previously (#26) 64092 out of 73111 cases (87.6%)
Now, 65479 out of 73117 cases (89.6%).
Needless to say, the print was not checked individually in these cases. However, from previous work, I think it highly probable that all of these were indeed typos (OCR) type errors in pw.txt.
It is good presumption to treat them as typo. Because their capital counterparts are already there in pw.txt.
So either must be wrong, or at least inconsistent.
It seems that these literary resources corrections are turning out to be the largest corrections that have been inserted in the Cologne dictionaries in recent past.
Good work @funderburkjim.
The next step will be to try something similar, but this time focusing first on cases where there are errors in the AS-numbers of pw citations. Then, it might be appropriate to try a 'fuzzy match' approach to see if that can identify some more likely corrections to pw.
The files in https://github.com/sanskrit-lexicon/PWK/tree/master/pw_ls/pwbib/diffstudy/correctionsubmission have been updated based on corrections. Also http://sanskrit-lexicon.github.io/PWK/cmbsub.html (Decreased from 2245 to 2044 lines) and http://sanskrit-lexicon.github.io/PWK/cbisub.html have been regenerated.
The script to run is https://github.com/sanskrit-lexicon/PWK/blob/master/pw_ls/pw_dhaval/abbrvwork/stdabbrv.sh
@funderburkjim Can you incorporate this shell file in your process of regenerating sortedcrefs.txt etc files ? That way the whole process till the standard format derivation for correction submission can be handled by you only. I don't have to regenerate it. The actual submission file is a manual copy paste of cmbsub.txt file - copied to crefminusbibsubmission.txt manually. Therefore, machine regeneration of cmbsub.txt file would not alter our precious crefminusbibsubmission.txt file. Once you do this, I will close this issue.
@funderburkjim It seems that 5 records have been missed.
After correction installations, I reran the script capredo.sh.
The output is
¯Bi7g4ag.@¯BI7G4AG.@t@capitalization
¯Ka7c2.@¯KA7C2.@t@capitalization
¯C2am5k.@¯C2AM5K.@t@capitalization
¯Maha7vi7rak4.@¯MAHA7VI7RAK4.@t@capitalization
¯R2v.@¯R2V.@t@capitalization
Total 5 entries need correction yet. Please complete this correction.
Glad you recomputed. Here's the changes generate. Not yet installed
¯Bi7g4ag.@¯BI7G4AG.@t@capitalization
; ¯Ka7c2.@¯KA7C2.@t@capitalization PREVIOUSLY CORRECTED
;¯C2am5k.@¯C2AM5K.@t@capitalization PREVIOUSLY CORRECTED
¯Maha7vi7rak4.@¯MAHA7VI7RAK4.@t@capitalization
;¯R2v.@¯R2V.@t@capitalization PREVIOUSLY CORRECTED
¯CAUS.R2v.@‹Caus.› ¯R2V.@t@ 'Caus.' = Causative, not a reference
Corrections now installed.
Rechecked Dhaval's list of 5 vs. current pw.txt Found two that I missed:
¯C2am5k
¯R2v
These were missed since they do not end in a period in pw.txt. Will correct as part of next batch.
@funderburkjim For corrections since last 5 days, there has been no update in manualByLine4.txt for PW syncing. and Corrections closed in the last 2 days don't have corresponding commits made in github.
Finish these two events ASAP.
@drdhaval2785
Not sure why you are seeing a problem.
I AM regenerating 'pwsync.zip' as part of the PW update process. The last date of that file, at Cologne, is 1/7/2016. I just downloaded pwsync.zip and looked at manualyByLine04 from there. It's last transaction is:
; pw, Issue 236, Case 327, user=dhavel_ejf,
; 01/08/2016, L=135775, key1=hvala
; ¯PA7R.¯Gr2hj -> ¯PA7R.GR2HJ :: typo
281043 old <H1>100{hvala}1{hvala}¦ ²1) •Adj. {%strauchelnd , taumelnd%} ¯PA7R.¯Gr2hj.3,7,3. ²2) •f. #{A/} {%das Irren , Verfehlen , Verunglücken.%} PW135773
; new
281043 new <H1>100{hvala}1{hvala}¦ ²1) •Adj. {%strauchelnd , taumelnd%} ¯PA7R.GR2HJ.3,7,3. ²2) •f. #{A/} {%das Irren , Verfehlen , Verunglücken.%} PW135773
which looks right.
What is your process for updating using pwsync.zip? There may be some detail in this process that is not quite right.
@drdhaval2785 have you got the 1/7/2016
file?
This correction submission at https://github.com/sanskrit-lexicon/PWK/issues/34#issuecomment-165970613 has thrown a possibility that there are non-capital letters in the literary resources.
If they can be corrected in one go - we can save a lot of time which would otherwise be gone in such submissions. Maybe some regex like
¯[A-Z0-9]*[a-z]+[^ ]*
should work. (7811 entries) Code is https://github.com/sanskrit-lexicon/PWK/blob/master/pw_ls/pw_dhaval/abbrvwork/capitalize/capital.py and output is https://github.com/sanskrit-lexicon/PWK/blob/master/pw_ls/pw_dhaval/abbrvwork/capitalize/cap0.txtThere is one danger point too. There ARE entries which have small letters in them for references too. e.g.
etc. But they are sparse and few. If the regex can be modified to exclude them or some tiny script written to discard them, we would get readymade list for capitalization.