sanskrit-kosha / kosha

Repository to store Sanskrit koshas and scripts to process them.
23 stars 16 forks source link

Semi automatic identification of headwords #38

Open drdhaval2785 opened 3 years ago

drdhaval2785 commented 3 years ago

There are currently two dictionaries for which headword identification is done by researchers i.e. Amarakosha and Vaijayanti.

Proposed workflow

  1. Read annotated Amarakosha (a1.txt let us say)
  2. Write code (c.py) to reach nearer to annotated headwords by identifying patterns. (DO NOT use v1.txt)
  3. Apply c.py to unannotated Vaijayanti (v0.txt)
  4. Compare v0.txt and v1.txt to calculate matching percentage.
  5. Improve c.py based on a1.txt to improve score.
gasyoun commented 3 years ago

If Amara has 10k words, how many headwords are spread around the 30 other kośas?

drdhaval2785 commented 3 years ago

I have no idea. But koshas such as koshakalpataru definitely has many more headwords.

vvasuki commented 3 years ago

Might be a good idea to try out @kmadathil's sanskrit_parser here.

kmadathil commented 3 years ago

We could try that - we might need some extra logic wrapped around it though.

$ python ../../scripts/sanskrit_parser sandhi 'स्वरव्ययं स्वर्गनाकत्रिदिवत्रिदशालयः सुरलोको द्योदिवौ द्वे स्त्रियां क्लीबे त्रिविष्टपम्' --strict
unable to import 'smart_open.gcs', disabling that module
Interpreting input strictly
INFO     Input String: स्वरव्ययं स्वर्गनाकत्रिदिवत्रिदशालयः सुरलोको द्योदिवौ द्वे स्त्रियां क्लीबे त्रिविष्टपम्
INFO     Input String in SLP1: svaravyayaM svarganAkatridivatridaSAlayaH suraloko dyodivO dve striyAM klIbe trivizwapam
Splits:
INFO     Split: ['svar', 'avyayam', 'svarga', 'nAka', 'tridiva', 'tridaSAlayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']

We would end up with splits like this. How do we get headwords from this?

vvasuki commented 3 years ago

We would end up with splits like this. How do we get headwords from this?

Next step: for each word in the split, figure out and print the prAtipadika.

kmadathil commented 3 years ago

You could do that using the morpohological tags feature for each word. When there are multiple options in our database (such as nAka, which - rightly, I think - is in our DB in both pum/napum variants), we may need to make a choice

$ time python ../../scripts/sanskrit_parser tags lokas --strict
unable to import 'smart_open.gcs', disabling that module
Interpreting input strictly
INFO     Input String: lokas
Input String in SLP1: lokas
Morphological tags:
(loka, {ekavacanam, puMlliNgam, praTamAviBaktiH})

real    0m5.324s
user    0m4.980s
sys     0m0.340s

$ time python ../../scripts/sanskrit_parser tags nAka --strict 
unable to import 'smart_open.gcs', disabling that module
Interpreting input strictly
INFO     Input String: nAka
Input String in SLP1: nAka
Morphological tags:
(nAka, {puMlliNgam, saMboDanaviBaktiH, ekavacanam})
(nAka, {napuMsakaliNgam, samAsapUrvapadanAmapadam})
(nAka, {puMlliNgam, samAsapUrvapadanAmapadam})
(nAka, {saMboDanaviBaktiH, napuMsakaliNgam, ekavacanam})

real    0m5.296s
user    0m4.924s
sys     0m0.368s
kmadathil commented 3 years ago

Here's a more interesting example:

$ time python ../../scripts/sanskrit_parser tags divO
unable to import 'smart_open.gcs', disabling that module
Interpreting input loosely (strict_io set to false)
INFO     Input String: divO
Input String in SLP1: divO
Morphological tags:
(div, {napuMsakaliNgam, dvitIyAviBaktiH, dvivacanam})
(div, {praTamAviBaktiH, dvivacanam, puMlliNgam})
(div, {praTamAviBaktiH, napuMsakaliNgam, dvivacanam})
(div, {puMlliNgam, dvivacanam, saMboDanaviBaktiH})
(div, {praTamAviBaktiH, strIliNgam, dvivacanam})
(div, {dvitIyAviBaktiH, dvivacanam, puMlliNgam})
(div, {dvitIyAviBaktiH, dvivacanam, strIliNgam})
(div, {strIliNgam, dvivacanam, saMboDanaviBaktiH})
(divi, {ekavacanam, puMlliNgam, saptamIviBaktiH})

real    0m5.249s
user    0m4.888s
sys     0m0.356s

Saptami of दिवि (खगभेदः)‌ is not what we want here, so we will need rules to pick the right option. However, that should be workable by content experts.

drdhaval2785 commented 3 years ago

@kmadathil Rather than doing this tag analysis, can we just find out the headword (from headwords of Cologne Sanskrit lexica) with the least edit diatance from such words? Would it be faster / more efficient?

We will also have to remove the stopwords 'dve', 'striyAm', 'klIbe' etc.

vvasuki commented 3 years ago

@drdhaval2785 Is speed / efficiency really important? Or is accuracy more important? I'd imagine the latter (supposing that you can afford to let the computer run for a couple of days). That may make your decision simpler.

drdhaval2785 commented 3 years ago

Speed is not of paramount importance.