sanskrit-lexicon / CORRECTIONS

Correction history for Cologne Sanskrit Lexicon
8 stars 5 forks source link

stats on fishy endings #178

Open drdhaval2785 opened 8 years ago

drdhaval2785 commented 8 years ago

As desired by @gasyoun the following is list of headword endings which are found in <50 headwords in sanhw1.txt. Not sorted anyhow. Just a rough list to play with.

How to use ?

  1. Open sanhw1.txt in notepad++
  2. By regex find words which have 'MS:' in the whole document (Find All in Current Document)
  3. Examine them to find out errors.

    177 was found by the same method.

[(32, u'MS'), (49, u'Ak'), (31, u'Mh'), (27, u'aN'), (7, u'aY'), (6, u'ot'), (31
, u'fc'), (37, u'qe'), (32, u'Re'), (23, u'go'), (21, u'Im'), (35, u'Mc'), (2, u
'Mn'), (46, u'Qi'), (27, u'yU'), (13, u'qU'), (25, u'zU'), (11, u'mU'), (7, u'nf
'), (12, u'Ku'), (7, u'Ag'), (37, u'ag'), (10, u'dy'), (17, u'tO'), (7, u'RU'),
(27, u'nU'), (6, u'mO'), (1, u'ID'), (32, u'rO'), (25, u'yO'), (10, u'RO'), (10,
 u'eH'), (22, u'nO'), (35, u'im'), (9, u'It'), (19, u'rc'), (19, u'gU'), (3, u'o
i'), (15, u'aG'), (13, u'Gi'), (28, u'Gu'), (3, u'Go'), (13, u'oH'), (2, u'Ni'),
 (5, u'NI'), (14, u'Mk'), (30, u'Nk'), (31, u'Uy'), (4, u'MK'), (21, u'NK'), (17
, u'Mg'), (33, u'Ng'), (10, u'MG'), (21, u'NG'), (39, u'iw'), (11, u'iR'), (18,
u'ia'), (40, u'Rq'), (1, u'ai'), (6, u'Ai'), (42, u'iv'), (33, u'nv'), (45, u'rE
'), (7, u'Ce'), (2, u'Ha'), (39, u'Iv'), (34, u'jU'), (20, u'Ja'), (2, u'Ji'), (
20, u'YI'), (8, u'cU'), (1, u'Ug'), (14, u'YC'), (43, u'Mj'), (10, u'So'), (13,
u'Wu'), (9, u'La'), (1, u'OW'), (25, u'ww'), (26, u'Al'), (22, u'aW'), (4, u'uA'
), (45, u'Wi'), (32, u'aq'), (4, u'Om'), (44, u'bi'), (4, u'qq'), (11, u'ig'), (
7, u'iY'), (21, u'RW'), (11, u'MW'), (1, u'Ea'), (46, u'ik'), (4, u'tT'), (39, u
'mp'), (49, u'rd'), (43, u'me'), (38, u'uS'), (17, u'rh'), (37, u'gE'), (11, u'i
N'), (33, u'zw'), (19, u'we'), (8, u'jf'), (4, u'fH'), (39, u'tF'), (20, u'Ip'),
 (44, u'Av'), (3, u'cf'), (47, u'nD'), (33, u'ev'), (49, u'Iq'), (36, u'Es'), (1
3, u'ed'), (8, u'iq'), (28, u'pf'), (22, u'pF'), (6, u'lB'), (20, u'ge'), (35, u
'cC'), (19, u'aC'), (38, u'ez'), (49, u'AD'), (41, u'fh'), (26, u'Az'), (44, u'U
z'), (28, u'Bf'), (8, u'sj'), (34, u'nT'), (31, u'jj'), (21, u'rz'), (10, u'lg')
, (24, u'ts'), (44, u'De'), (41, u'uq'), (5, u'vF'), (18, u'ep'), (19, u'Qf'), (
28, u'ro'), (29, u'Il'), (21, u'el'), (3, u'Id'), (38, u'pe'), (20, u'rD'), (23,
 u'mf'), (2, u'IR'), (20, u'Iz'), (10, u'fn'), (38, u'to'), (6, u'Na'), (22, u'd
f'), (31, u'Cu'), (29, u'Uh'), (11, u'oh'), (23, u'Te'), (16, u'et'), (13, u'iM'
), (37, u'mu'), (32, u'ry'), (3, u'To'), (21, u'so'), (48, u'sy'), (42, u'Ap'),
(18, u'uN'), (14, u'dU'), (4, u'eN'), (23, u'do'), (1, u'dw'), (1, u'dq'), (2, u
'Ot'), (43, u'uw'), (8, u'nE'), (11, u'gf'), (25, u'yo'), (43, u'je'), (8, u'Ke'
), (12, u'PA'), (14, u'uW'), (37, u'rR'), (7, u'Ik'), (23, u'Ir'), (5, u'IS'), (
4, u'Do'), (19, u'ok'), (36, u'kF'), (6, u'eD'), (1, u'En'), (15, u'GI'), (3, u'
hy'), (20, u'ce'), (45, u'ke'), (14, u'In'), (11, u'AN'), (18, u'Aq'), (34, u'ne
'), (1, u'tk'), (12, u'EH'), (15, u'Mp'), (19, u'rt'), (30, u'Uj'), (11, u'Ft'),
 (18, u'xp'), (38, u'Md'), (20, u'gF'), (7, u'Mt'), (6, u'Co'), (12, u'lp'), (17
, u'ns'), (8, u'jF'), (7, u'uY'), (14, u'rk'), (12, u'fB'), (14, u'dF'), (4, u'A
T'), (1, u'Lp'), (26, u'iK'), (12, u'zo'), (2, u'Sy'), (9, u'AR'), (11, u'rT'),
(20, u'MD'), (4, u'rg'), (11, u'Ij'), (15, u'Sc'), (15, u'en'), (23, u'll'), (21
, u'yf'), (9, u'Sf'), (8, u'ug'), (7, u'ub'), (16, u'tv'), (28, u'rC'), (1, u'UK
'), (6, u'UN'), (5, u'jJ'), (12, u'Ut'), (10, u'Ud'), (16, u'Yu'), (22, u'no'),
(4, u'nc'), (2, u'nj'), (2, u'ec'), (10, u'iy'), (28, u'av'), (27, u'Be'), (9, u
'oc'), (1, u'nh'), (3, u'af'), (12, u'ej'), (31, u'az'), (8, u'zE'), (17, u'uT')
, (14, u'Is'), (6, u'Un'), (1, u'vr'), (1, u'UB'), (14, u'SF'), (2, u'oB'), (8,
u'oz'), (8, u'Yi'), (7, u'ab'), (9, u'bj'), (3, u'MH'), (10, u'jO'), (10, u'Gf')
, (6, u'JA'), (2, u'Ju'), (6, u'fs'), (8, u'UR'), (11, u'We'), (15, u'er'), (4,
u'iT'), (6, u'fq'), (28, u'Se'), (7, u'iB'), (3, u'DO'), (20, u'kE'), (13, u'Us'
), (5, u'Uc'), (13, u'ps'), (4, u'oj'), (5, u'pE'), (9, u'fC'), (5, u'Br'), (1,
u'rw'), (19, u'Mb'), (23, u'MB'), (6, u'Bo'), (8, u'lo'), (5, u'dE'), (5, u'ko')
, (4, u'rG'), (10, u'Ro'), (4, u'dO'), (2, u'dd'), (10, u'rp'), (5, u'rP'), (16,
 u'rb'), (35, u'rv'), (7, u'rS'), (32, u'bU'), (25, u'il'), (12, u'Ul'), (2, u'M
Q'), (28, u'Mq'), (8, u'es'), (1, u'Ls'), (3, u'iP'), (2, u'sE'), (3, u'rm'), (1
3, u'zy'), (13, u'SU'), (4, u'hO'), (4, u'Sv'), (16, u'CI'), (7, u'vO'), (18, u'
QI'), (3, u'Et'), (5, u'kO'), (21, u'lU'), (11, u'fg'), (10, u'fN'), (10, u'sO')
, (7, u'ho'), (4, u'AI'), (2, u'Au'), (1, u'Af'), (10, u'kU'), (7, u'IM'), (4, u
'tt'), (21, u'Ur'), (4, u'AC'), (13, u'Ci'), (3, u'MC'), (3, u'cE'), (31, u'ul')
, (1, u'Ge'), (8, u'be'), (11, u'ol'), (4, u'bd'), (3, u'px'), (5, u'po'), (10,
u'eq'), (15, u'tU'), (1, u'Ek'), (17, u'lE'), (9, u'co'), (1, u'nM'), (8, u'lO')
, (2, u'LA'), (1, u'Mv'), (5, u'iC'), (2, u'op'), (1, u'mv'), (3, u'jy'), (35, u
'vu'), (2, u'jE'), (19, u'Ry'), (9, u'Dy'), (1, u'IL'), (1, u'IK'), (14, u'IN'),
 (3, u'Ih'), (11, u'vo'), (2, u'uK'), (11, u'on'), (6, u'Er'), (5, u'uC'), (8, u
'CU'), (1, u'YJ'), (5, u'pO'), (12, u'ny'), (3, u'Ic'), (1, u'dg'), (10, u'MT'),
 (1, u'dj'), (1, u'dJ'), (1, u'wF'), (1, u'yM'), (1, u'aF'), (7, u'Ok'), (4, u'd
v'), (1, u'ds'), (4, u'zO'), (14, u'or'), (4, u'mF'), (5, u'lh'), (3, u'lP'), (3
, u'ED'), (2, u'od'), (4, u'BO'), (2, u'UM'), (1, u'M~'), (1, u'UW'), (4, u'Um')
, (31, u'mE'), (2, u'U~'), (11, u'fR'), (3, u'fP'), (9, u'mP'), (1, u'FH'), (27,
 u'x'), (1, u'xN'), (1, u'xw'), (1, u'X'), (1, u'XH'), (5, u'Ew'), (3, u'gO'), (
4, u'eW'), (3, u'wE'), (4, u'ey'), (1, u'em'), (13, u'zf'), (5, u'eh'), (1, u'oM
'), (1, u'oK'), (7, u'Kf'), (8, u'jo'), (5, u'oR'), (7, u'Rf'), (5, u'om'), (1,
u'lj'), (1, u'o~'), (12, u'OH'), (1, u'O~'), (12, u'un'), (15, u'kk'), (2, u'kK'
), (9, u'aK'), (17, u'Rw'), (11, u'Mw'), (5, u'uy'), (2, u'fT'), (2, u'PI'), (4,
 u'bf'), (5, u'Pu'), (9, u'vy'), (3, u'TO'), (1, u'Va'), (19, u'my'), (7, u'fw')
, (11, u'wy'), (8, u'zk'), (2, u'Iw'), (9, u'uR'), (6, u'dr'), (2, u'Nu'), (4, u
'ib'), (13, u'qE'), (2, u'By'), (2, u'sm'), (2, u'qo'), (1, u'Uw'), (8, u'Uq'),
(11, u'Up'), (1, u'Sm'), (6, u'fY'), (5, u'Rv'), (1, u'fb'), (1, u'fv'), (1, u'z
R'), (3, u'FY'), (2, u'kx'), (1, u'xb'), (10, u'ex'), (6, u'UY'), (1, u'z2'), (3
, u'Li'), (2, u'Lu'), (16, u'qf'), (1, u'cO'), (4, u'dD'), (3, u'Ib'), (7, u'eS'
), (32, u'ty'), (1, u'rH'), (4, u'ow'), (3, u'wU'), (1, u'Mr'), (1, u'qM'), (14,
 u'ew'), (3, u'rf'), (3, u'KE'), (2, u'Ko'), (1, u'ox'), (6, u'oq'), (2, u'gG'),
 (1, u'mx'), (5, u'bb'), (2, u'lv'), (7, u'uM'), (3, u'uP'), (1, u'Qe'), (1, u'z
W'), (1, u'OT'), (12, u'ek'), (4, u'Mz'), (1, u'sx'), (4, u'RR'), (1, u'MR'), (5
, u'iA'), (1, u'rn'), (3, u'A~'), (1, u'NO'), (1, u'KO'), (2, u'CO'), (2, u'aQ')
, (9, u'st'), (1, u'sh'), (2, u'hn'), (5, u'IB'), (1, u'fI'), (6, u'cy'), (2, u'
uQ'), (1, u'AU'), (1, u'Ye'), (1, u'YO'), (1, u'JJ'), (3, u'Qu'), (2, u'rJ'), (2
, u'Fz'), (1, u'og'), (11, u'J'), (2, u'vU'), (2, u'JI'), (1, u'Jf'), (1, u'JF')
, (3, u'IY'), (1, u'RQ'), (6, u'mo'), (3, u'rq'), (5, u'iG'), (1, u'au'), (2, u'
Ty'), (1, u'nk'), (4, u'PU'), (3, u'EN'), (3, u'TU'), (1, u'TE'), (3, u'rB'), (2
, u'AY'), (1, u'ck'), (2, u'Hk'), (2, u'HK'), (3, u'Ky'), (2, u'Hp'), (1, u'uG')
, (1, u'HI'), (1, u'zK'), (2, u'py'), (1, u'Ep'), (1, u'Or'), (1, u'OS'), (1, u'
Oz'), (1, u'Os'), (6, u'AK'), (8, u'AG'), (3, u'mm'), (1, u'ii'), (1, u'US'), (1
, u'DF'), (2, u'oI'), (4, u'sv'), (2, u'uv'), (2, u'RE'), (7, u'Tf'), (2, u'AB')
, (1, u'nF'), (2, u'ES'), (1, u'Ez'), (1, u'fA'), (1, u'tx'), (1, u'iu'), (1, u'
fM'), (2, u'wo'), (1, u'wr'), (1, u'Ab'), (1, u'qv'), (2, u'cc'), (1, u'iW'), (3
, u'zx'), (1, u'zp'), (4, u'eR'), (4, u'eb'), (3, u'ER'), (3, u'pS'), (1, u'Le')
, (1, u'oT'), (2, u'Pi'), (1, u'Pe'), (3, u'hl'), (2, u'hv'), (2, u'bF'), (1, u'
bo'), (1, u'hm'), (1, u'4n'), (1, u'WO'), (1, u'gv'), (1, u'BF'), (1, u'nS'), (2
, u'aU'), (1, u'aI'), (2, u'Dv'), (5, u'sk'), (1, u'eM'), (4, u'Sr'), (4, u'sr')
, (1, u'Ig'), (1, u'cx'), (1, u'fL'), (1, u'gy'), (5, u'fl'), (1, u'eG'), (2, u'
eT'), (1, u'eC'), (3, u'Ow'), (6, u'Oq'), (1, u'aP'), (3, u'tE'), (1, u'MP'), (2
, u'ee'), (1, u'Ia'), (1, u'ui'), (2, u'eL'), (1, u'eB'), (1, u'Ey'), (1, u'GU')
, (1, u'Gv'), (3, u'vv'), (2, u'Wy'), (1, u'RT'), (1, u'lf'), (1, u'lF'), (1, u'
eK'), (1, u'qQ'), (1, u'Je'), (2, u'SE'), (4, u'lk'), (2, u'yv'), (3, u'dx'), (3
, u'eY'), (1, u'Ml'), (1, u'Em'), (2, u'SO'), (1, u'ly'), (1, u'lb'), (1, u'aL')
, (2, u'Fh'), (1, u'2a'), (1, u'ks'), (3, u'ng'), (2, u'Wf'), (1, u'wO'), (1, u'
HU'), (1, u'Uk'), (1, u'sF'), (1, u'Mm'), (1, u'ao'), (1, u'eu'), (1, u'yy'), (1
, u'iL'), (1, u'ss'), (1, u'fW'), (1, u'hF'), (1, u'eQ'), (2, u'hE'), (1, u'oQ')
, (1, u'IC')]
drdhaval2785 commented 8 years ago

http://sanskrit-lexicon.github.io/CORRECTIONS/abnormending/abnorm.html is the output.

https://github.com/sanskrit-lexicon/CORRECTIONS/tree/master/abnormending is the code.

Execution

Run this shell file to regenerate the results.

Logic -

  1. The last two letters of each word in sanhw1.txt is stored as 'endings'.
  2. The words in sanhw1.txt are checked and the count of endings are shown e.g. (2, u'hE') i.e. the words ending with 'hE' are only 2. See this for full list.
  3. Only endings having less than 50 entries (thereby meaning less frequent ones) are kept.
  4. This list is sorted in ascending order (1,2,3.....50).
  5. sanhw1.txt is checked for three criteria (1) words ending in this sorted list of point 4. (2) the word should be seen only in one dictionary and (3) the word should not be seen in nochange. To put in regex terms, if re.search(end+':[^,]*$',datum) and datum not in noc.
  6. Words passing the above mentioned criteria are stored in abnorm.txt.
  7. Webpage and PDF are linked by link.php and stored in abnorm.html.
  8. abnorm.html is put on github.io for potential errors and submit corrections.
gasyoun commented 8 years ago

@drdhaval2785 I give you my thanks. I can see dozens of mistakes just by glance. I will start documenting them Hope @zaaf2 is not lost from PWK and PWG.