NOTE (July 2018): The description below is out-of-date at the moment. It can only be used to get an approximate idea.
Table of contents :TOC:
What
Code to identify the metre of a Sanskrit verse.
Web version currently serving at http://sanskritmetres.appspot.com/
Can also be used as a Python library.
** Examples
In the web version, try the following inputs.
kāṣṭhād agnir jāyate mathya-mānād- bhūmis toyaṃ khanya-mānā dadāti| sotsāhānāṃ nāsty asādhyaṃ narāṇāṃ mārgārabdhāḥ sarva-yatnāḥ phalanti||
or (note that this one intentionally has many typos):
काष्ठाद् अग्नि जायते मथ्यमानाद्भूमिस्तोय खन्यमाना ददाति। सोत्साहानां नास्त्यसाध्यं नराणां मार्गारब्धाः सवयत्नाः फलन्ति॥
If using as a library (TODO: document this better):
import identifier_pipeline
verse = r'''kāṣṭhād agnir jāyate mathya-mānād- bhūmis toyaṃ khanya-mānā dadāti| sotsāhānāṃ nāsty asādhyaṃ narāṇāṃ mārgārabdhāḥ sarva-yatnāḥ phalanti||'''
identifier = identifier_pipeline.IdentifierPipeline() match_results = identifier.IdentifyFromText(verse)
The design of the program is as follows.
** Transform the input (Read, Scan)
The input passes through the following representations.
*** The raw input
This is whatever is typed into the textarea (for the web interface) or given as input to `IdentifierPipeline`.
Consider the examples above.
*** The input in slp1
Whatever the input script (transliteration scheme) used,
the input is cleaned up and "read" into a limited Sanskrit alphabet (slp1).
For instance, the examples above are read as the following:
#+BEGIN_EXAMPLE
kAzWAdagnirjAyatemaTyamAnAd
BUmistoyaMKanyamAnAdadAti
sotsAhAnAMnAstyasADyaMnarARAM
mArgArabDAHsarvayatnAHPalanti
#+END_EXAMPLE
and
#+BEGIN_EXAMPLE
kAzWAdagnijAyate
maTyamAnAdBUmistoyaKanyamAnAdadAti
sotsAhAnAMnAstyasADyaM
narARAMmArgArabDAHsavayatnAHPalanti
#+END_EXAMPLE
respectively.
*** The metrical signature of the input
We next [[https://en.wikipedia.org/wiki/Scansion][scan]] the input, to reduce it to a pattern of /laghu/ (denoted L) and /guru/ (denoted G) syllables.
Our two examples above are scanned into the lists:
#+BEGIN_EXAMPLE
['GGGGGLGGLGG',
'GGGGGLGGLGL',
'GGGGGLGGLGG',
'GGGGGLGGLGL']
#+END_EXAMPLE
and
#+BEGIN_EXAMPLE
['GGGLGLG',
'GLGGGGGLGLGGLGL',
'GGGGGLGG',
'LGGGGGGLLGGLGL']
#+END_EXAMPLE
respectively.
** Identify
Finally, we compare this metrical signature (or "pattern lines") against a database of known patterns.
For example, in our database we have the information that Śālinī is a /sama-vṛtta/ metre consisting of 4 lines (/pāda/-s / quarters) each having the pattern
GGGG—GLGGLGG
Thus Śālinī is recognized as the (probable, best-guess) metre of the input verse.
Note that in the second example, even though no line matches a line of Śālinī, the program is still clever enough to detect a match.
Look at the README inside the ~identify~ directory for more details on the matching heuristics used.
Thus the code can detect partial matches: if there are metrical errors in the verse, but some parts of it are in some metre, then that metre still has a chance of being recognized.
We might also have multiple results when we have multiple metres guessed, such as when different lines are in different metres.
** Display
The detected metre is displayed, along with how the verse fits the metre, and information about the metre.
TODO: Describe this.
(Everything below this line needs even more rewriting.)
See deps.png for the dependency graph.
** Read
Covered by the files in ~read~ and their dependencies.
Detecting the transliteration format of the input, removing junk characters that are not part of the verse, and transliterating the input to SLP1 (the encoding we use internally).
** Scan
Determining the pattern of gurus and laghus.
The functions in scan.py take this cleaned-up verse, and convert it to a pattern of laghus and gurus. A "pattern" means a sequence over the alphabet {'L', 'G'}.
** Identify
Identification algorithm: Given a verse,
1. Look for the full verse's pattern in ~known_metre_patterns~.
2. Loop through ~known_metre_regexes~ and see if any match the full
verses's pattern.
3. Look in ~known_partial_patterns~ (then ~known_partial_regexes~) for:
-- whole verse,
-- each line,
-- each half,
-- each quarter.
4. [TODO/Maybe] Look for substrings, find closest match, etc.?
Might have to restrict to the popular metres for efficiency.
** Metrical data
* A "pattern" means a sequence over the alphabet {'L', 'G'}.
* A "regex" (for us) is a regular expression that matches some patterns.
(TODO: This is obsolete.)
We use the following data structures:
* ~known_metre_patterns~, a dict mapping a pattern to a MatchResult.
* ~known_metre_regexes~, a list of pairs (regex, MatchResult).
* ~known_partial_patterns~, a dict mapping a pattern to ~MatchResult~-s.
* ~known_partial_regexes~, a list of pairs (regex, MatchResult).
A MatchResult is usually arrived at by looking at a pattern (or list of
patterns), and can be seen as a tuple (metre_name, match_type):
metre_name - name of the metre,
match_type - used to distinguish between matching one pāda (quarter) or one
ardha (half) of a metre. Or, in ardha-sama metres, it can
distinguish between odd and even pādas.
** Display
Display the list of metres found as possible guesses. For vrtta metres, we also try to "align" the input verse to the metre, so that it's more clear where to break it, etc. (And when the input verse has metrical errors, it's clear what they are.)