Closed kmadathil closed 4 years ago
You've made very good progress so far, and I think the next steps would be quite valuable.
We should also start thinking about alternative approaches such as this. Even though the paper does not have all the details to easily reproduce, it should be enough to get us started with some experiments.
I have been very impressed with the work of Prof. Amba Kulkarni and associates. Her papers are quite interesting.
I have begun implementing a graph based L3 based on those ideas. Please see branch l3new.
@avinashvarna @vvasuki @codito If you guys are still watching this, please look at the l3new branch. We have simple sentences parsable now.
I have made the following structural changes. Both the lexical_analyzer
and morphological_analyzer
packages have been merged into sanskrit_parser.parser
. I have added scripts/sanskrit_parser which works as the command line engine.
(l3new)$ scripts/sanskrit_parser vakya devadattogrAmaMgacCati --dot graph.dot
INFO Input String: devadattogrAmaMgacCati
INFO Input String in SLP1: devadattogrAmaNgacCati
...
INFO Parses after validity check 4
INFO Time for parse: 0.041228s
Parse 0
devadattaH=>['devadatta', {ekavacanam, praTamAviBaktiH, puMlliNgam}]
grAmam=>['grAma', {ekavacanam, napuMsakaliNgam, dvitIyAviBaktiH}]
gacCati=>['gam', {ekavacanam, kartari, praTamapuruzaH, law, prATamikaH}]
Parse 1
devadattaH=>['devadatta', {ekavacanam, praTamAviBaktiH, puMlliNgam}]
grAmam=>['grAma', {ekavacanam, napuMsakaliNgam, dvitIyAviBaktiH}]
gacCati=>['gam', {ekavacanam, parasmEpadam, praTamapuruzaH, law}]
Parse 2
devadattaH=>['devadatta', {ekavacanam, praTamAviBaktiH, puMlliNgam}]
grAmam=>['grAma', {ekavacanam, puMlliNgam, dvitIyAviBaktiH}]
gacCati=>['gam', {ekavacanam, parasmEpadam, praTamapuruzaH, law}]
Parse 3
devadattaH=>['devadatta', {ekavacanam, praTamAviBaktiH, puMlliNgam}]
grAmam=>['grAma', {ekavacanam, puMlliNgam, dvitIyAviBaktiH}]
gacCati=>['gam', {ekavacanam, kartari, praTamapuruzaH, law, prATamikaH}]
INFO End Morphological Analysis
INFO -----------
INFO Performance
INFO Time taken for split: 1.017759s
INFO Time taken for path: 0.000128s
I've attached the parse graphs:
THis is the graph with all possible relationships:
This is the subgraph for the first consistent parse:
That is pretty impressive. I will take a look at the new branch over the weekend
Why not the following split - "देवदत्तः अग्रामम् गच्छति"? Also, will "देवदत्तोग्रामङ्गच्छति" yield similar results?
Second question - Yes:
$ scripts/sanskrit_parser vakya देवदत्तोग्रामङ्गच्छति
INFO Input String: देवदत्तोग्रामङ्गच्छति
INFO Input String in SLP1: devadattogrAmaNgacCati
...
Splits:
Lexical Split: [devadattaH, grAmam, gacCati]
INFO Lexical Split: [devadattaH, grAmam, gacCati]
...
Parse 0
devadattaH=>['devadatta', {puMlliNgam, ekavacanam, praTamAviBaktiH}]
grAmam=>['grAma', {ekavacanam, dvitIyAviBaktiH, puMlliNgam}]
gacCati=>['gam', {prATamikaH, law, kartari, parasmEpadI, ekavacanam, praTamapuruzaH}]
Parse 1
devadattaH=>['devadatta', {puMlliNgam, ekavacanam, praTamAviBaktiH}]
grAmam=>['grAma', {ekavacanam, dvitIyAviBaktiH, puMlliNgam}]
gacCati=>['gam', {ekavacanam, law, praTamapuruzaH, parasmEpadam}]
Parse 2
devadattaH=>['devadatta', {puMlliNgam, ekavacanam, praTamAviBaktiH}]
grAmam=>['grAma', {ekavacanam, napuMsakaliNgam, dvitIyAviBaktiH}]
gacCati=>['gam', {prATamikaH, law, kartari, parasmEpadI, ekavacanam, praTamapuruzaH}]
Parse 3
devadattaH=>['devadatta', {puMlliNgam, ekavacanam, praTamAviBaktiH}]
grAmam=>['grAma', {ekavacanam, napuMsakaliNgam, dvitIyAviBaktiH}]
gacCati=>['gam', {ekavacanam, law, praTamapuruzaH, parasmEpadam}]
"Why not the following split - "देवदत्तः अग्रामम् गच्छति"?" - It does appear in the splits (, so a high enough --max-paths should produce that too in the parses. The default for vakya analysis is --max-paths 1
(l3new)$ scripts/sanskrit_parser sandhi देवदत्तोग्रामङ्गच्छति
Interpreting input loosely (strict_io set to false)
Input String: देवदत्तोग्रामङ्गच्छति
Input String in SLP1: devadattogrAmaNgacCati
Start Split
End DAG generation
End pathfinding 1579194977.2579806
Splits:
[devadattaH, grAmam, gacCati]
[deva, dattaH, grAmam, gacCati]
[devadattaH, a, grAmam, gacCati]
[devadattaH, agra, Amam, gacCati]
[deva, datta, U, grAmam, gacCati]
[devadatta, U, grAmam, gacCati]
[devadatta, u, grAmam, gacCati]
[deva, dattA, U, grAmam, gacCati]
[devadattA, u, grAmam, gacCati]
[devadattA, U, grAmam, gacCati]
-----------
Performance
Time for graph generation = 0.830699s
Total time for graph generation + find paths = 0.875322s
With the score option added to vakya_analyzer, I do see parses for this split with --max-paths 10
Lexical Split: [devadattaH, a, grAmam, gacCati]
Parse 0
devadattaH=>['devadatta', {ekavacanam, puMlliNgam, praTamAviBaktiH}]
a=>['a', {puMlliNgam, samAsapUrvapadanAmapadam}]
grAmam=>['grAma', {ekavacanam, dvitIyAviBaktiH, napuMsakaliNgam}]
gacCati=>['gam', {ekavacanam, kartari, praTamapuruzaH, prATamikaH, parasmEpadI, law}]
Parse 1
devadattaH=>['devadatta', {ekavacanam, puMlliNgam, praTamAviBaktiH}]
a=>['a', {puMlliNgam, samAsapUrvapadanAmapadam}]
grAmam=>['grAma', {ekavacanam, dvitIyAviBaktiH, napuMsakaliNgam}]
gacCati=>['gam', {praTamapuruzaH, parasmEpadam, ekavacanam, law}]
Parse 2
devadattaH=>['devadatta', {ekavacanam, puMlliNgam, praTamAviBaktiH}]
a=>['a', {puMlliNgam, samAsapUrvapadanAmapadam}]
grAmam=>['grAma', {ekavacanam, dvitIyAviBaktiH, puMlliNgam}]
gacCati=>['gam', {ekavacanam, kartari, praTamapuruzaH, prATamikaH, parasmEpadI, law}]
Parse 3
devadattaH=>['devadatta', {ekavacanam, puMlliNgam, praTamAviBaktiH}]
a=>['a', {puMlliNgam, samAsapUrvapadanAmapadam}]
grAmam=>['grAma', {ekavacanam, dvitIyAviBaktiH, puMlliNgam}]
gacCati=>['gam', {praTamapuruzaH, parasmEpadam, ekavacanam, law}]
As of now, we find paths through the SandhiGraph
and create a VakyaGraph
for each path. Once we're happy with these, we can go on to creating a VakyaGraph
directly from a SandhiGraph
instead. This will require algo tweaks.
The basic algorithm to come up with a valid parse is a generalized spanning tree, which is known to be NP. We need to tread carefully.
Closing this - any particular issues with this can be discussed in separate threads
Current State
SanskritMorphologicalAnalyzer is implemented as a constraint solver (using python-constraints) that does the following
The following sets of constraints are implemented
This can handle phrases, or simple sentences
Way forward
The following rules are concievable in the same context
Basic extensions
Beyond this, we need to handle non-simple sentences (ie - more than one verb form). This will require recognizing various types of non-simple sentences (resisting the temptation to call them complex/compound as Apte does, but maybe thats the way forward?). a. Recognize dhatu forms that cause non-simple sentence structures. b. Partition into simple sentences and apply constraints as above on each sentence. c. Partitioning might not be trivial or non-ambiguous - it may require some graph algorithms
Performance: Is the current solver (or the finite domain constraint model itself) optimal for this? Can we recast the entire problem in graph form in some way?
Thoughts?