sanskrit-lexicon / PWK

Sanskrit-Wörterbuch in kürzerer Fassung, 7 Bände Petersburg 1879-1889
3 stars 1 forks source link

Change in the submission file #42

Closed drdhaval2785 closed 8 years ago

drdhaval2785 commented 8 years ago

Right now the submission file was created in the same order as the base file. We want to attack the frequently occurring errors / corrections / matching first and then the rarer ones. Therefore a modification was done in the logic of the stdabbrv.py code which generated the cmbsub.txt / cbisub.txt / cmbsub.html / cbisub.htm files from crefminusbib.txt and crefbibintersect.txt. This was done in https://github.com/sanskrit-lexicon/PWK/commit/0a7ce9d81f7d7c2677ca383d0e8ae301f02291f2.

The logic is as follows:

  1. sortedcrefs.txt file has entry in the format - abbrv@key1@key2@lnumber@count e.g.BHAT2T2@apariskandam@*apariskandam@6169@368 This means that the abbreviation BHAT2T2 appeared in pw.txt for 368 times.
  2. This occurrence was not exploited till now.
  3. The current commit makes use of this item to sort the data in descending order i.e. the 'potential errors' entries are sorted in descending order of occurrence in pw.txt.

This would increase the chances of finding the common ones first and will push the rarer ones in the end of the submission file.

Statistics - After this corrections, the files cmbsub.txt and cbisub.txt were regenerated. The first five entries in cmbsub.txt now are

¯BURNELL.T@maDvaBAzya@maDvaBAzya@82746:¯BURNELL.T:n:
¯C2A7N5KH@aGAhan@aGAhan@849:¯C2A7N5KH:n:
¯HEM@cItkfta@cItkfta@40339:¯HEM:n:
¯K4AMAPAKA@rahitatva@°rahitatva@93133:¯K4AMAPAKA:n:
¯LI7LA7V.S@krAkacya@krAkacya@31805:¯LI7LA7V.S:n:

When their occurrences were analysed in sortedcrefs.txt, the following were the occurrences

BURNELL.T-965
C2A7N5KH-92
HEM-90
K4AMAPAKA-9
LI7LA7V.S-9

Thus correcting / matching them would fetch us more corrections rather than single entries.

This is a documentation issue.

drdhaval2785 commented 8 years ago

crefminusbibsubmission.txt file was also modified (manually). Currently the crefminusbibsubmission.txt file has 99 corrections (submitted in 6 parts in this repository earlier). 100th line onwards, the file cmbsub.txt (regenerated by the above logic) was copy pasted.

I know this would complicate the references made by marcis and me difficult to locate, but this enhancement has some practical implications. So went for it.

gasyoun commented 8 years ago

I'm always for stats. Will finish first German words rounds without them, but still.