kmadathil / sanskrit_parser

Parsers for Sanskrit / संस्कृतम्
MIT License
69 stars 21 forks source link

L3: Morphological Analyzer : Current state and way forward #65

Closed kmadathil closed 4 years ago

kmadathil commented 6 years ago

Current State

SanskritMorphologicalAnalyzer is implemented as a constraint solver (using python-constraints) that does the following

  1. Get all valid splits from SanskritLexicalAnalyzer
  2. Apply a bunch of rules (modelled as finite domain constraints) on each split, with lexical tags on each word
  3. Output tagged splits that satisfy constraints.

The following sets of constraints are implemented

  1. (Optional) Need exactly one lakara in the sentence
  2. Last word cannot be an upasarga or samAsapUrvapada
  3. If a lakara is found, all words in prathamA vibhakti must match it in gender and number. If words in sambodhana exist, then the lakara must be in madhyamapuruSha
  4. All words in the same vibhakti must match in gender and number
  5. Only sakarmaka dhatus are allowed to have objects (in dvitIya)
  6. All samAsapUrvapadas must be followed by another samAsapUrvapada or a subanta

This can handle phrases, or simple sentences

Way forward

The following rules are concievable in the same context

  1. Upasargas must be followed by a dhAtu form
  2. karmapravacanIyas must be preceded or succeeded by a term in dvItiya (except for the exceptions defined in ashTAdhyAyi 2.3)
  3. Where avyayas have a defined vibhakti for associated terms, such a term must precede or succeed it (as defined in ashTAdhyAyi 2.3. Eg: vina takes dvitIya/tritIya/pa~ncami)

Basic extensions

  1. Handle (kta, ktavatu) instead/additional to a lakara (Rules 1,3,5)
  2. Handle krts appearing as noun phrases (सा तेन कर्तव्यं करोति must not be flagged invalid etc.)
  3. Handle शतृ शानच् appearing as a) noun phrase, b) verb लक्षण

Beyond this, we need to handle non-simple sentences (ie - more than one verb form). This will require recognizing various types of non-simple sentences (resisting the temptation to call them complex/compound as Apte does, but maybe thats the way forward?). a. Recognize dhatu forms that cause non-simple sentence structures. b. Partition into simple sentences and apply constraints as above on each sentence. c. Partitioning might not be trivial or non-ambiguous - it may require some graph algorithms

Performance: Is the current solver (or the finite domain constraint model itself) optimal for this? Can we recast the entire problem in graph form in some way?

Thoughts?

avinashvarna commented 6 years ago

You've made very good progress so far, and I think the next steps would be quite valuable.

We should also start thinking about alternative approaches such as this. Even though the paper does not have all the details to easily reproduce, it should be enough to get us started with some experiments.

kmadathil commented 4 years ago

I have been very impressed with the work of Prof. Amba Kulkarni and associates. Her papers are quite interesting.

I have begun implementing a graph based L3 based on those ideas. Please see branch l3new.

kmadathil commented 4 years ago

@avinashvarna @vvasuki @codito If you guys are still watching this, please look at the l3new branch. We have simple sentences parsable now.

I have made the following structural changes. Both the lexical_analyzer and morphological_analyzer packages have been merged into sanskrit_parser.parser. I have added scripts/sanskrit_parser which works as the command line engine.

(l3new)$ scripts/sanskrit_parser vakya devadattogrAmaMgacCati --dot graph.dot
INFO     Input String: devadattogrAmaMgacCati
INFO     Input String in SLP1: devadattogrAmaNgacCati
...
INFO     Parses after validity check 4
INFO     Time for parse: 0.041228s
Parse 0
devadattaH=>['devadatta', {ekavacanam, praTamAviBaktiH, puMlliNgam}]
grAmam=>['grAma', {ekavacanam, napuMsakaliNgam, dvitIyAviBaktiH}]
gacCati=>['gam', {ekavacanam, kartari, praTamapuruzaH, law, prATamikaH}]
Parse 1
devadattaH=>['devadatta', {ekavacanam, praTamAviBaktiH, puMlliNgam}]
grAmam=>['grAma', {ekavacanam, napuMsakaliNgam, dvitIyAviBaktiH}]
gacCati=>['gam', {ekavacanam, parasmEpadam, praTamapuruzaH, law}]
Parse 2
devadattaH=>['devadatta', {ekavacanam, praTamAviBaktiH, puMlliNgam}]
grAmam=>['grAma', {ekavacanam, puMlliNgam, dvitIyAviBaktiH}]
gacCati=>['gam', {ekavacanam, parasmEpadam, praTamapuruzaH, law}]
Parse 3
devadattaH=>['devadatta', {ekavacanam, praTamAviBaktiH, puMlliNgam}]
grAmam=>['grAma', {ekavacanam, puMlliNgam, dvitIyAviBaktiH}]
gacCati=>['gam', {ekavacanam, kartari, praTamapuruzaH, law, prATamikaH}]
INFO     End Morphological Analysis
INFO     -----------
INFO     Performance
INFO     Time taken for split: 1.017759s
INFO     Time taken for path: 0.000128s

I've attached the parse graphs:

THis is the graph with all possible relationships: Graph with all possible relationships

This is the subgraph for the first consistent parse: Parse Graph for First Parse

avinashvarna commented 4 years ago

That is pretty impressive. I will take a look at the new branch over the weekend

vvasuki commented 4 years ago

Why not the following split - "देवदत्तः अग्रामम् गच्छति"? Also, will "देवदत्तोग्रामङ्गच्छति" yield similar results?

kmadathil commented 4 years ago

Second question - Yes:

$ scripts/sanskrit_parser vakya देवदत्तोग्रामङ्गच्छति
INFO     Input String: देवदत्तोग्रामङ्गच्छति
INFO     Input String in SLP1: devadattogrAmaNgacCati
...
Splits:
Lexical Split: [devadattaH, grAmam, gacCati]
INFO     Lexical Split: [devadattaH, grAmam, gacCati]
...
Parse 0
devadattaH=>['devadatta', {puMlliNgam, ekavacanam, praTamAviBaktiH}]
grAmam=>['grAma', {ekavacanam, dvitIyAviBaktiH, puMlliNgam}]
gacCati=>['gam', {prATamikaH, law, kartari, parasmEpadI, ekavacanam, praTamapuruzaH}]
Parse 1
devadattaH=>['devadatta', {puMlliNgam, ekavacanam, praTamAviBaktiH}]
grAmam=>['grAma', {ekavacanam, dvitIyAviBaktiH, puMlliNgam}]
gacCati=>['gam', {ekavacanam, law, praTamapuruzaH, parasmEpadam}]
Parse 2
devadattaH=>['devadatta', {puMlliNgam, ekavacanam, praTamAviBaktiH}]
grAmam=>['grAma', {ekavacanam, napuMsakaliNgam, dvitIyAviBaktiH}]
gacCati=>['gam', {prATamikaH, law, kartari, parasmEpadI, ekavacanam, praTamapuruzaH}]
Parse 3
devadattaH=>['devadatta', {puMlliNgam, ekavacanam, praTamAviBaktiH}]
grAmam=>['grAma', {ekavacanam, napuMsakaliNgam, dvitIyAviBaktiH}]
gacCati=>['gam', {ekavacanam, law, praTamapuruzaH, parasmEpadam}]
kmadathil commented 4 years ago

"Why not the following split - "देवदत्तः अग्रामम् गच्छति"?" - It does appear in the splits (, so a high enough --max-paths should produce that too in the parses. The default for vakya analysis is --max-paths 1

(l3new)$ scripts/sanskrit_parser sandhi देवदत्तोग्रामङ्गच्छति
Interpreting input loosely (strict_io set to false)
Input String: देवदत्तोग्रामङ्गच्छति
Input String in SLP1: devadattogrAmaNgacCati
Start Split
End DAG generation
End pathfinding 1579194977.2579806
Splits:
[devadattaH, grAmam, gacCati]
[deva, dattaH, grAmam, gacCati]
[devadattaH, a, grAmam, gacCati]
[devadattaH, agra, Amam, gacCati]
[deva, datta, U, grAmam, gacCati]
[devadatta, U, grAmam, gacCati]
[devadatta, u, grAmam, gacCati]
[deva, dattA, U, grAmam, gacCati]
[devadattA, u, grAmam, gacCati]
[devadattA, U, grAmam, gacCati]
-----------
Performance
Time for graph generation = 0.830699s
Total time for graph generation + find paths = 0.875322s

With the score option added to vakya_analyzer, I do see parses for this split with --max-paths 10

Lexical Split: [devadattaH, a, grAmam, gacCati]
Parse 0
devadattaH=>['devadatta', {ekavacanam, puMlliNgam, praTamAviBaktiH}]
a=>['a', {puMlliNgam, samAsapUrvapadanAmapadam}]
grAmam=>['grAma', {ekavacanam, dvitIyAviBaktiH, napuMsakaliNgam}]
gacCati=>['gam', {ekavacanam, kartari, praTamapuruzaH, prATamikaH, parasmEpadI, law}]
Parse 1
devadattaH=>['devadatta', {ekavacanam, puMlliNgam, praTamAviBaktiH}]
a=>['a', {puMlliNgam, samAsapUrvapadanAmapadam}]
grAmam=>['grAma', {ekavacanam, dvitIyAviBaktiH, napuMsakaliNgam}]
gacCati=>['gam', {praTamapuruzaH, parasmEpadam, ekavacanam, law}]
Parse 2
devadattaH=>['devadatta', {ekavacanam, puMlliNgam, praTamAviBaktiH}]
a=>['a', {puMlliNgam, samAsapUrvapadanAmapadam}]
grAmam=>['grAma', {ekavacanam, dvitIyAviBaktiH, puMlliNgam}]
gacCati=>['gam', {ekavacanam, kartari, praTamapuruzaH, prATamikaH, parasmEpadI, law}]
Parse 3
devadattaH=>['devadatta', {ekavacanam, puMlliNgam, praTamAviBaktiH}]
a=>['a', {puMlliNgam, samAsapUrvapadanAmapadam}]
grAmam=>['grAma', {ekavacanam, dvitIyAviBaktiH, puMlliNgam}]
gacCati=>['gam', {praTamapuruzaH, parasmEpadam, ekavacanam, law}]
kmadathil commented 4 years ago

As of now, we find paths through the SandhiGraph and create a VakyaGraph for each path. Once we're happy with these, we can go on to creating a VakyaGraph directly from a SandhiGraph instead. This will require algo tweaks. The basic algorithm to come up with a valid parse is a generalized spanning tree, which is known to be NP. We need to tread carefully.

kmadathil commented 4 years ago

Closing this - any particular issues with this can be discussed in separate threads