Open funderburkjim opened 8 years ago
The latest run of the whitelist program shows these statistics regarding the number of cases whitelisted by the various rules:
$ sh redo.sh
Recreating auxiliary/special.txt
regenerating graylist.txt and all whitelistX.txt files
385011 records from ../hwnorm1c.txt
187992 headwords coded as 0: In two or more dictionaries
6357 headwords coded as 1: key1=X+am and X+a is found
1773 headwords coded as 2: SKD nouns shown in nominative singular
19238 headwords coded as 3a: prefix of known word
11095 headwords coded as 3b: suffix of known word
8823 headwords coded as 0a: special words (icf, foreign, etc.)
2589 headwords coded as 4a: probable f. nouns ending in 'A'
618 headwords coded as 4b: inflected form
74009 headwords coded as cpd1: simple compound of 2 parts, first ending in 'aiufeoAIOxs'
8749 headwords coded as cpd1a: simple compound of 2 parts, first ending at 'A,I'
698 headwords coded as cpd1b: simple compound of 2 parts, first ending at 't'
9712 headwords coded as cpd2: simple compound of 3 or more parts, each part ending in 'aiufeo'
4441 headwords coded as cpd2a: simple compound of 3 or more parts, each part ending in 'aiufeoAI'
17696 headwords coded as cpdsrs1: Simple compound with vowel sandhi at 'AIUoeEOvy'
3319 headwords coded as cpdsrs1a: non-Simple compound with vowel sandhi at 'AIUoeEOvy'
16 headwords coded as 5a: kar<->kf
57 headwords coded as 5b: Ikf, IBU, Ikfta, IBUta
810 headwords coded as 3bcpd: Compound word + suffix
629 headwords coded as 3acpd: prefix + Compound word
1103 headwords coded as cpdsandhi1: Compound word with sandhi
819 headwords coded as 3a1: prefix of known compound
95 headwords coded as cpdsandhi2: Compound word with sandhi
2555 headwords coded as cpdsrs1b: non-Simple compound with vowel sandhi at 'AIUoeEOvy'
21818 headwords coded as gray: Not yet whitelisted
Wow!
If we exclude PD
for now, it's only 5547 words. I hoped it would be more.
1326 ACC
- only one who knows valid names of manuscripts can approve.
I have checked a few
paSU PW
plIyA PW
buDi PW
BAgI PW
mUrKI PW
Why plIyA
and not pliyA
? SLP1 converter issue (because PalIkartavE
same I
issue, but ok in livI
)?
Why in s.u. [see under = sehe unter] s. is bolded, u. is not? Makes no sense, let's check if it's same in different entries.
My views regarding a few of the observations on whitelist approach
try to ignore the historically similar dictionary pairs from this approach
yes, that was my main concern. Otherwise it's no real whitelisting, only a ghost list.
The 'paired dictionary' observation makes sense as a good refinement.
The graylist file is still a good candidate error list. If we excluded the paired words, graylist would be INCREASED, not decreased.
There is a tool which may be used to examine paired-dictionary lists.
For example, to generate a list of all headwords which appear ONLY in WIL and YAT,
# change to the whitelist directory
python filterdict.py output/all/whitelist0.txt old/wilyat.txt wil yat
#The output is old/wilyat.txt. I put the output in the 'old' subdirectory of whitelist, since
# files in that directory are excluded, due to the way .gitignore for hwnorm1 repository is set up.
There turn out to be 185 such words (out of 187,992 words in whitelist0).
Why in s.u. [see under = sehe unter] s. is bolded, u. is not?
Here's the reason:
In pw.txt, the entry is coded as:
<H1>100{plIyA}1{plIyA/}¦ •f. •»s.u. #{plI/TA}. PW75184
Note the two instances of the special symbol: •
The program which converts pw.txt to pw.xml interprets that special symbol as the beginning of grammatical information; and, further, it makes some guess as to the scope of this grammatical information. This results in the following coding of this record in pw.xml:
<gram n="f">f.</gram> »<gram n="s">s.</gram>u. <s>plI/TA</s>.
<gram>
element as
being in an html <span class='gram'>
element, and css renders the gram
class as bold.So, that explains what is going on.
There are 1165 instances in pw.txt of •»s.u.
, all of which are presumably rendered as just described.
Let's continue the discussion of markup •»s.u. in this issue under the PWK repository.
If we excluded the paired words, graylist would be INCREASED, not decreased.
Indeed, but that would make the logic usable. Now we exclude from greylist words, that are still fishy. A word that YAT
took from WIL
does not become less fishy. So pairs:
Should be counted as one for whitelisting needs or at least marked.
What I really lack is sample words for each case. Otherwise some seem equal or I do not understand at all what should go there. Maybe the statement rules would help?
187992 headwords coded as 0: In two or more dictionaries
aMSagaRa:IEG,PD
paryantIkfta:BHS,MW
paryavadApayitar:BHS,SCH
6357 headwords coded as 1: key1=X+am and X+a is found 1773 headwords coded as 2: SKD nouns shown in nominative singular 19238 headwords coded as 3a: prefix of known word 11095 headwords coded as 3b: suffix of known word 8823 headwords coded as 0a: special words (icf, foreign, etc.) 2589 headwords coded as 4a: probable f. nouns ending in 'A'
paryavadAtaSrutatA:MW
paryavasA:STC
SatrUccAwanakriyA:ACC
618 headwords coded as 4b: inflected form
74009 headwords coded as cpd1: simple compound of 2 parts, first ending in 'aiufeoAIOxs'
8749 headwords coded as cpd1a: simple compound of 2 parts, first ending at 'A,I'
698 headwords coded as cpd1b: simple compound of 2 parts, first ending at 't'
9712 headwords coded as cpd2: simple compound of 3 or more parts, each part ending in 'aiufeo'
4441 headwords coded as cpd2a: simple compound of 3 or more parts, each part ending in 'aiufeoAI'
17696 headwords coded as cpdsrs1: Simple compound with vowel sandhi at 'AIUoeEOvy'
3319 headwords coded as cpdsrs1a: non-Simple compound with vowel sandhi at 'AIUoeEOvy'
16 headwords coded as 5a: kar<->kf
Should we we make it not only kar<->kf, but ar<->f?
57 headwords coded as 5b: Ikf, IBU, Ikfta, IBUta
810 headwords coded as 3bcpd: Compound word + suffix 629 headwords coded as 3acpd: prefix + Compound word 1103 headwords coded as cpdsandhi1: Compound word with sandhi 819 headwords coded as 3a1: prefix of known compound 95 headwords coded as cpdsandhi2: Compound word with sandhi 2555 headwords coded as cpdsrs1b: non-Simple compound with vowel sandhi at 'AIUoeEOvy' 21818 headwords coded as gray: Not yet whitelisted
vitaritar PW
vituzI PW
viduzIbruvA PW
viDanI PW
viDurI PW
vinAyikI PW
vinikze PW
vinigaqI PW
vinDyAy PW
vipaYcay PW
vipaTay PW
vipay PW
I would love to tag the greylisted ones. In vipay
I see a praefix, it's a verb. Where should it go? Without samples I do not understand the details of the above classification.
What I really lack is sample words for each case.
For each of the categories in that summary, there is a corresponding file of examples.
For instance '11095 headwords coded as 3b: suffix of known word' . The corresponding file is
Here are 4 files, generated using the 'tool' mentioned above, that contain the words that appear ONLY in the particular pairs of dictionaries. This is done in response to above requests. I hope someone learns something useful from these.
I guess @drdhaval2785 would agree, I would exclude someone learns something useful from these
from whitelist, because they are almost as one.
Are there headwords misspelled in both wilson and yates ? That should be the focus of attention, it seems to me. Thus finding misspellings would be something useful.
Similarly for the other paired dictionaries.
The whitelist directory contains work aimed at identifying headwords of the various Sanskrit dictionaries that may have spelling errors.
The underlying set of headwords is hwnorm1c.txt, which currently has 385,011 headwords.
The idea of the whitelist approach is to identify words which, on the basis of rules, are probably NOT misspelled. For instance, one such rule for a given spelling is that the word with that spelling appears as a headword in two or more dictionaries. All words satisfying a particular such rule are put into a whitelist file (whitelist0.txt for the rule just described). The latest batches of these whitelist files are in the output/all directory. There are currently 24 such whitelist files.
Then, the headwords whose spellings have as yet no rule to justify the correctness of their spelling are gathered into a graylist.txt file.
Currently, there are 21818 graylisted headwords.
According to the logic of this whitelist approach, the graylisted words are the most fertile ground for remaining headword spelling errors.
Of course, many of the graylisted words are surely spelled correctly. But, as of yet, we don't have any programmatic (or other) way to distinguish these as correctly spelled.
I hope others will examine these lists, especially the graylist, with an eye to: