Parsers for Sanskrit / संस्कृतम्
NOTE: This project is still under development. Both over-generation (invalid forms/splits) and under-generation (missing valid forms/splits) are quite likely. Please see the Sanskrit Parser Stack section below for detailed status. Report any issues here.
Please feel free to ping us if you would like to collaborate on this project.
This project has been tested and developed using Python 3.7 - 3.9. To install the package:
pip install sanskrit_parser
To enable statistical scoring based on DCS, please also install gensim and sentencepiece:
pip install gensim sentencepiece
See next section for some options if gensim installation fails, and you need the scoring feature.
pip install
failsThe scoring implementation in sanskrit_parser
depends on gensim
for scoring,
which requires the capability to build C extensions for Python. If you have an appropriate C compiler for your system, gensim
should be installed automatically during pip install
. We have seen some cases where pip install
is unable to install gensim
on Windows, and the following instructions are for those situations.
On Windows, gensim
typically requires the installation of Microsoft build tools for Visual studio 2019 as documented here. If you cannot, or do not want to install MS build tools to compile extensions, some alternate options are:
Try it out
sectionRun:
sudo mkdir /var/www/.sanskrit_parser
sudo chmod a+rwx /var/www/.sanskrit_parser
cd docs; make html
Stack of parsing tools
Sandhi splitting subroutine Input: Phoneme sequence and Phoneme number to split at Action: Perform a sandhi split at given input phoneme number Output: left and right sequences (multiple options will be output). No semantic validation will be performed (up to higher levels)
Module that performs sandhi split/join and convenient rule definition is at parser/sandhi.py
.
Rule definitions (human readable!) are at lexical_analyzer/sandhi_rules/*.txt
This is not accessed standalone from the command line.
Bootstrapped using a lexical lookup module built from
(Either or both of these can be enabled at runtime)
That gives us the minimum we need from Level 1, so Level 2 can work. As the generator sub-project matures, that will take over the role of this Level
Use sanskrit_parser tags
on the command line to access this
Sanskrit Sentence
Using dynamic programming, assemble the results of all choices
To split or not to split at each phoneme
If split, all possible left/right combination of phonemes that can result
Once split, check if the left section is a valid pada (use level 1 tools to pick pada type and tag morphologically)
If left section is valid, proceed to split the right section
All semantically valid sandhi split sequences
Module at parser/sandhi_analyer.py
Use sanskrit_parser sandhi
on the command line
Semantically valid sequence of tagged padas (output of Level 1)
Assemble graphs of morphological constraints
viseShaNa - viseShya
karaka/vibhakti
vachana/puruSha constraints on tiGantas and subantas
Module at parser/vakya_analyer.py
Use sanskrit_parser vakya
on the command line
Generate any valid sanskrit pada using Ashtadhyayi rules, plus vartikas where necessary.
Rules are input in a high level meta-language (currently yaml with imposed semantics - this may change), and the internal rule engine executes rules till a valid pada form is output. Input may be
subantas of ajanta prAtipadikas are currently implemented. Other features are being rolled in.
Use sanskrit_generator
on the command line
See: Grammar as a Foreign Language : Vinyals & Kaiser et. al. Google http://arxiv.org/abs/1412.7449
Sanskrit sentence
Sentence split into padas with tags
DCS corpus, converted by Vishvas Vasuki
Not begun