whitelist - Githubissues

funderburkjim commented 8 years ago

The whitelist directory contains work aimed at identifying headwords of the various Sanskrit dictionaries that may have spelling errors.

The underlying set of headwords is hwnorm1c.txt, which currently has 385,011 headwords.

The idea of the whitelist approach is to identify words which, on the basis of rules, are probably NOT misspelled. For instance, one such rule for a given spelling is that the word with that spelling appears as a headword in two or more dictionaries. All words satisfying a particular such rule are put into a whitelist file (whitelist0.txt for the rule just described). The latest batches of these whitelist files are in the output/all directory. There are currently 24 such whitelist files.

Then, the headwords whose spellings have as yet no rule to justify the correctness of their spelling are gathered into a graylist.txt file.

Currently, there are 21818 graylisted headwords.

According to the logic of this whitelist approach, the graylisted words are the most fertile ground for remaining headword spelling errors.

Of course, many of the graylisted words are surely spelled correctly. But, as of yet, we don't have any programmatic (or other) way to distinguish these as correctly spelled.

I hope others will examine these lists, especially the graylist, with an eye to:

Develop additional rules that would whitelist chunks of these
identifying by hand remaining errors, maybe by focusing on those graylisted words in a particular dictionary.

funderburkjim commented 8 years ago

The latest run of the whitelist program shows these statistics regarding the number of cases whitelisted by the various rules:

$ sh redo.sh
Recreating auxiliary/special.txt
regenerating graylist.txt and all whitelistX.txt files
385011 records from ../hwnorm1c.txt
187992 headwords coded as 0: In two or more dictionaries
  6357 headwords coded as 1: key1=X+am and X+a is found
  1773 headwords coded as 2: SKD nouns shown in nominative singular
 19238 headwords coded as 3a: prefix of known word
 11095 headwords coded as 3b: suffix of known word
  8823 headwords coded as 0a: special words (icf, foreign, etc.)
  2589 headwords coded as 4a: probable f. nouns ending in 'A'
   618 headwords coded as 4b: inflected form
 74009 headwords coded as cpd1: simple compound of 2 parts, first ending in 'aiufeoAIOxs'
  8749 headwords coded as cpd1a: simple compound of 2 parts, first ending at 'A,I'
   698 headwords coded as cpd1b: simple compound of 2 parts, first ending at 't'
  9712 headwords coded as cpd2: simple compound of 3 or more parts, each part ending in 'aiufeo'
  4441 headwords coded as cpd2a: simple compound of 3 or more parts, each part ending in 'aiufeoAI'
 17696 headwords coded as cpdsrs1: Simple compound with vowel sandhi at 'AIUoeEOvy'
  3319 headwords coded as cpdsrs1a: non-Simple compound with vowel sandhi at 'AIUoeEOvy'
    16 headwords coded as 5a: kar<->kf
    57 headwords coded as 5b: Ikf, IBU, Ikfta, IBUta
   810 headwords coded as 3bcpd: Compound word + suffix
   629 headwords coded as 3acpd: prefix + Compound word
  1103 headwords coded as cpdsandhi1: Compound word with sandhi
   819 headwords coded as 3a1: prefix of known compound
    95 headwords coded as cpdsandhi2: Compound word with sandhi
  2555 headwords coded as cpdsrs1b: non-Simple compound with vowel sandhi at 'AIUoeEOvy'
 21818 headwords coded as gray: Not yet whitelisted

gasyoun commented 8 years ago

Wow!

If we exclude PD for now, it's only 5547 words. I hoped it would be more. 1326 ACC - only one who knows valid names of manuscripts can approve.

I have checked a few

paSU    PW
plIyA   PW
buDi    PW
BAgI    PW
mUrKI   PW

Why plIyA and not pliyA? SLP1 converter issue (because PalIkartavE same I issue, but ok in livI)?

susu

Why in s.u. [see under = sehe unter] s. is bolded, u. is not? Makes no sense, let's check if it's same in different entries.

drdhaval2785 commented 8 years ago

My views regarding a few of the observations on whitelist approach

whitelist0 - We should try to ignore the historically similar dictionary pairs from this approach. If SKD and VCP show the same word / YAT and WIL show the same word / PW and PWG show the same word, but no other dictionary shows the same word, they should not be put in whitelist0. They tend to repeat the same mistakes as their predecessors.

gasyoun commented 8 years ago

try to ignore the historically similar dictionary pairs from this approach yes, that was my main concern. Otherwise it's no real whitelisting, only a ghost list.

funderburkjim commented 8 years ago

The 'paired dictionary' observation makes sense as a good refinement.

The graylist file is still a good candidate error list. If we excluded the paired words, graylist would be INCREASED, not decreased.

There is a tool which may be used to examine paired-dictionary lists.

For example, to generate a list of all headwords which appear ONLY in WIL and YAT,

# change to the whitelist directory
python filterdict.py output/all/whitelist0.txt old/wilyat.txt wil yat
#The output is old/wilyat.txt.  I put the output in the 'old' subdirectory of whitelist, since
# files in that directory are excluded, due to the way .gitignore for hwnorm1 repository is set up.

There turn out to be 185 such words (out of 187,992 words in whitelist0).

funderburkjim commented 8 years ago

Why in s.u. [see under = sehe unter] s. is bolded, u. is not?

Here's the reason:

In pw.txt, the entry is coded as:
```
<H1>100{plIyA}1{plIyA/}¦ •f. •»s.u. #{plI/TA}. PW75184
```
Note the two instances of the special symbol: •
The program which converts pw.txt to pw.xml interprets that special symbol as the beginning of grammatical information; and, further, it makes some guess as to the scope of this grammatical information. This results in the following coding of this record in pw.xml:
```
<gram n="f">f.</gram> »<gram n="s">s.</gram>u. <s>plI/TA</s>.
```
Finally, the display program (web/webtc/disp.php) marks the text within the <gram> element as being in an html <span class='gram'> element, and css renders the gram class as bold.

So, that explains what is going on.

There are 1165 instances in pw.txt of •»s.u. , all of which are presumably rendered as just described.

funderburkjim commented 8 years ago

Let's continue the discussion of markup •»s.u. in this issue under the PWK repository.

gasyoun commented 8 years ago

If we excluded the paired words, graylist would be INCREASED, not decreased.

Indeed, but that would make the logic usable. Now we exclude from greylist words, that are still fishy. A word that YAT took from WIL does not become less fishy. So pairs:

SKD, VCP
YAT, WIL
PW, PWG
MW72, MW

Should be counted as one for whitelisting needs or at least marked.

gasyoun commented 8 years ago

What I really lack is sample words for each case. Otherwise some seem equal or I do not understand at all what should go there. Maybe the statement rules would help?

187992 headwords coded as 0: In two or more dictionaries

aMSagaRa:IEG,PD
paryantIkfta:BHS,MW
paryavadApayitar:BHS,SCH

6357 headwords coded as 1: key1=X+am and X+a is found 1773 headwords coded as 2: SKD nouns shown in nominative singular 19238 headwords coded as 3a: prefix of known word 11095 headwords coded as 3b: suffix of known word 8823 headwords coded as 0a: special words (icf, foreign, etc.) 2589 headwords coded as 4a: probable f. nouns ending in 'A'

paryavadAtaSrutatA:MW
paryavasA:STC
SatrUccAwanakriyA:ACC

618 headwords coded as 4b: inflected form 74009 headwords coded as cpd1: simple compound of 2 parts, first ending in 'aiufeoAIOxs' 8749 headwords coded as cpd1a: simple compound of 2 parts, first ending at 'A,I' 698 headwords coded as cpd1b: simple compound of 2 parts, first ending at 't' 9712 headwords coded as cpd2: simple compound of 3 or more parts, each part ending in 'aiufeo' 4441 headwords coded as cpd2a: simple compound of 3 or more parts, each part ending in 'aiufeoAI' 17696 headwords coded as cpdsrs1: Simple compound with vowel sandhi at 'AIUoeEOvy' 3319 headwords coded as cpdsrs1a: non-Simple compound with vowel sandhi at 'AIUoeEOvy' 16 headwords coded as 5a: kar<->kf Should we we make it not only kar<->kf, but ar<->f?

57 headwords coded as 5b: Ikf, IBU, Ikfta, IBUta

810 headwords coded as 3bcpd: Compound word + suffix 629 headwords coded as 3acpd: prefix + Compound word 1103 headwords coded as cpdsandhi1: Compound word with sandhi 819 headwords coded as 3a1: prefix of known compound 95 headwords coded as cpdsandhi2: Compound word with sandhi 2555 headwords coded as cpdsrs1b: non-Simple compound with vowel sandhi at 'AIUoeEOvy' 21818 headwords coded as gray: Not yet whitelisted

vitaritar   PW
vituzI  PW
viduzIbruvA PW
viDanI  PW
viDurI  PW
vinAyikI    PW
vinikze PW
vinigaqI    PW
vinDyAy PW
vipaYcay    PW
vipaTay PW
vipay   PW

I would love to tag the greylisted ones. In vipay I see a praefix, it's a verb. Where should it go? Without samples I do not understand the details of the above classification.

funderburkjim commented 8 years ago

What I really lack is sample words for each case.

For each of the categories in that summary, there is a corresponding file of examples.

For instance '11095 headwords coded as 3b: suffix of known word' . The corresponding file is

output/all/whitelist3b.txt

funderburkjim commented 8 years ago

Here are 4 files, generated using the 'tool' mentioned above, that contain the words that appear ONLY in the particular pairs of dictionaries. This is done in response to above requests. I hope someone learns something useful from these.

185 lines from output/all/whitelist0.txt written to output/all/wil_yat.txt
310 lines from output/all/whitelist0.txt written to output/all/skd_vcp.txt
2752 lines from output/all/whitelist0.txt written to output/all/pw_pwg.txt
2417 lines from output/all/whitelist0.txt written to output/all/mw72_mw.txt

gasyoun commented 8 years ago

I guess @drdhaval2785 would agree, I would exclude someone learns something useful from these from whitelist, because they are almost as one.

funderburkjim commented 8 years ago

Are there headwords misspelled in both wilson and yates ? That should be the focus of attention, it seems to me. Thus finding misspellings would be something useful.

Similarly for the other paired dictionaries.

sanskrit-lexicon / hwnorm1

whitelist #5