Open drdhaval2785 opened 3 years ago
If Amara has 10k words, how many headwords are spread around the 30 other kośas?
I have no idea. But koshas such as koshakalpataru definitely has many more headwords.
Might be a good idea to try out @kmadathil's sanskrit_parser here.
We could try that - we might need some extra logic wrapped around it though.
$ python ../../scripts/sanskrit_parser sandhi 'स्वरव्ययं स्वर्गनाकत्रिदिवत्रिदशालयः सुरलोको द्योदिवौ द्वे स्त्रियां क्लीबे त्रिविष्टपम्' --strict
unable to import 'smart_open.gcs', disabling that module
Interpreting input strictly
INFO Input String: स्वरव्ययं स्वर्गनाकत्रिदिवत्रिदशालयः सुरलोको द्योदिवौ द्वे स्त्रियां क्लीबे त्रिविष्टपम्
INFO Input String in SLP1: svaravyayaM svarganAkatridivatridaSAlayaH suraloko dyodivO dve striyAM klIbe trivizwapam
Splits:
INFO Split: ['svar', 'avyayam', 'svarga', 'nAka', 'tridiva', 'tridaSAlayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']
We would end up with splits like this. How do we get headwords from this?
We would end up with splits like this. How do we get headwords from this?
Next step: for each word in the split, figure out and print the prAtipadika.
You could do that using the morpohological tags feature for each word. When there are multiple options in our database (such as nAka, which - rightly, I think - is in our DB in both pum/napum variants), we may need to make a choice
$ time python ../../scripts/sanskrit_parser tags lokas --strict
unable to import 'smart_open.gcs', disabling that module
Interpreting input strictly
INFO Input String: lokas
Input String in SLP1: lokas
Morphological tags:
(loka, {ekavacanam, puMlliNgam, praTamAviBaktiH})
real 0m5.324s
user 0m4.980s
sys 0m0.340s
$ time python ../../scripts/sanskrit_parser tags nAka --strict
unable to import 'smart_open.gcs', disabling that module
Interpreting input strictly
INFO Input String: nAka
Input String in SLP1: nAka
Morphological tags:
(nAka, {puMlliNgam, saMboDanaviBaktiH, ekavacanam})
(nAka, {napuMsakaliNgam, samAsapUrvapadanAmapadam})
(nAka, {puMlliNgam, samAsapUrvapadanAmapadam})
(nAka, {saMboDanaviBaktiH, napuMsakaliNgam, ekavacanam})
real 0m5.296s
user 0m4.924s
sys 0m0.368s
Here's a more interesting example:
$ time python ../../scripts/sanskrit_parser tags divO
unable to import 'smart_open.gcs', disabling that module
Interpreting input loosely (strict_io set to false)
INFO Input String: divO
Input String in SLP1: divO
Morphological tags:
(div, {napuMsakaliNgam, dvitIyAviBaktiH, dvivacanam})
(div, {praTamAviBaktiH, dvivacanam, puMlliNgam})
(div, {praTamAviBaktiH, napuMsakaliNgam, dvivacanam})
(div, {puMlliNgam, dvivacanam, saMboDanaviBaktiH})
(div, {praTamAviBaktiH, strIliNgam, dvivacanam})
(div, {dvitIyAviBaktiH, dvivacanam, puMlliNgam})
(div, {dvitIyAviBaktiH, dvivacanam, strIliNgam})
(div, {strIliNgam, dvivacanam, saMboDanaviBaktiH})
(divi, {ekavacanam, puMlliNgam, saptamIviBaktiH})
real 0m5.249s
user 0m4.888s
sys 0m0.356s
Saptami of दिवि (खगभेदः) is not what we want here, so we will need rules to pick the right option. However, that should be workable by content experts.
@kmadathil Rather than doing this tag analysis, can we just find out the headword (from headwords of Cologne Sanskrit lexica) with the least edit diatance from such words? Would it be faster / more efficient?
We will also have to remove the stopwords 'dve', 'striyAm', 'klIbe' etc.
@drdhaval2785 Is speed / efficiency really important? Or is accuracy more important? I'd imagine the latter (supposing that you can afford to let the computer run for a couple of days). That may make your decision simpler.
Speed is not of paramount importance.
There are currently two dictionaries for which headword identification is done by researchers i.e. Amarakosha and Vaijayanti.
Proposed workflow