Flag non-English words which are not headwords for examination

drdhaval2785 commented 7 years ago

There are many cases in acc6.txt where the word is Sanskrit, but not included in the dictionary headwords. e.g. See Darśapaurṇamāsapaddhati and Darśapaurṇamāsaprayoga in the following entry. This may hint to missed headwords. I am sure someone might be actually interested in finding author based on the work. It is better if we can scrape out such missed cases properly and tag them as missing headwords or something like that. Modality can be decided later. Currently I am interested in only identifying Lnum and the potential missed headword.

<L>550<pc>1-014,1<k1>anantadeva<k2>anantadeva
{#anantadeva#}¦
<HI1><ab type="hw" value="agnihotraprayoga">Agnihotraprayoga</ab>. <ls>L</ls>. 1390.
<HI1><ab type="hw" value="antyezwipadDati">Antyeṣṭipaddhati</ab>. <ls>L</ls>. 830.
<HI1><ab type="hw" value="ADAna">Ādhāna</ab>. <ls>K</ls>. 4. <ls>B</ls>. 1, 182 (Baudh.).
<HI1><ab type="hw" value="utsargapadDati">Utsargapaddhati</ab>. <ls>B</ls>. 1, 216.
<HI1><ab type="hw" value="ftvigvaraRanirRaya">Ṛtvigvaraṇanirṇaya</ab>. <ls>Bhk</ls>. 12.
<HI1><ab type="hw" value="gAyatrIpuraScaraRaviDi">Gāyatrīpuraścaraṇavidhi</ab>. <ls>NP</ls>. VII, 8.
<HI1>Darśapaurṇamāsapaddhati. <ls>K</ls>. 8.
<HI1>Darśapaurṇamāsaprayoga. <ls>NP</ls>. VII, 14.
<HI1><ab type="hw" value="punarADeyaprayoga">Punarādheyaprayoga</ab>. <ls>B</ls>. 1, 230.
<LEND>

gasyoun commented 7 years ago

Darśapaurṇamāsapaddhati and Darśapaurṇamāsaprayoga

Not a good idea to track for longer than usual English words, I guess. After them comes an abbreviation and letters (including Roman). That's a patttern, I would say.

drdhaval2785 commented 7 years ago

Filtering out the English ones by pyenchant library. So only non English are going to be highlighted.

drdhaval2785 commented 7 years ago

Started generating log files. On dev server pywork/issue-acc-5/descFreq.txt gives the words which are non English and missed. They may be subject, catalogue or headword tags.

drdhaval2785 commented 7 years ago

UPDATE descFreq.txt

This file gives detail about the missed subject / catalogue / headword tags.

A superficial reading says that the list if quite useful. e.g. Extr:635 Dīkṣita:567 Libr:309 Gov:304 Paṇḍita:289 Av:257

Extr may mean Extra / Extract - No idea Dīkṣita - A common surname Libr,Gov - Missed cases of Gov. Or. Libr. Madras due to line breaks Av - aTarvaveda related treatises.

gasyoun commented 7 years ago

Indeed, most are real Sanskrit words, good catch!

Extract

Makes more sense than Extra :)

sanskrit-lexicon / ACC

Flag non-English words which are not headwords for examination #5