kmadathil / sanskrit_parser

Parsers for Sanskrit / संस्कृतम्
MIT License
69 stars 21 forks source link

Thoughts on L3: morphological constraints #28

Closed kmadathil closed 6 years ago

kmadathil commented 7 years ago

Some thoughts - not aiming to handle all possibilities right away

Input - Top n paths from Lexical Graph output from L2.

Step 1 - Karaka assignments

Step 2 - Use a constraint solver to pick paths (with tags) that satisfy all constraints

If this is possible to do using the Lexical Graph output of L2 directly, that might help. That would be sort of a constrained path search. I fear, though that that would make things worse in comparison to working on paths. Perhaps this can be decomposed into a set of graph problems that will simplify things?

avinashvarna commented 7 years ago

I haven't looked at the PCRW algorithm mentioned in Paper 4 in your doc (Amrith Krishna, et al.), but is that simple enough to implement in this context?

Otherwise, can we split this into two stages of enforcing local constraints vs global constraints?

  1. Enforce local constraints such as samasa constituents, upasarga, etc.: Start with lexical graph G to create morphological graph MG with local constraints
    for node in G:
    for tag of node:
      for child in node.children:
          for child_tag of child:
              if tag, child_tag satisfies local constraints:
                  MG.add_edge(((node, tag), (child, child_tag)))
  2. Check if global constraints such as visheShaNa/visheShya, etc. are satisfied
    for path in MG.shortest_simple_paths(start, end):
    if path satisfies global constraints:
     yield path

It will involve creating a larger graph and thereby potentially increase the no. of paths. However, we can then use a simpler checker to filter out invalid paths. Caveat: The computational overhead of finding paths in the larger graph might be significant, and we may end up throwing away many of the paths in the filtering step

I will think about this and see if there are better solutions

avinashvarna commented 7 years ago

Another option I can think of is to train a neural network (such as the seq2seq models) that takes in the paths/data from the lexical graph and outputs the probability of it being a valid sentence. This would require a lot of data to train the NN though.

kmadathil commented 7 years ago

Things to look at 1) Finite domain constraint programming over paths already extracted from the LG 2) Graph algorithms for constrained path search on the MG

Approach 1) is what I was thinking of. You seem to be thinking of approach 2) if I read the comment right.

Either can work, I think - it's a matter of finding the right algorithms. I think a good bit of reading is required here, as we explore these two approaches.

For finite domain constraint programming, we have two main options -- google or tools and python-constraint.

avinashvarna commented 7 years ago

Yes, my comment is more along the lines of 2, but trying to decompose it into simpler problems (like you suggested in your original post) at the cost of higher computation.

I did look at google or tools documentation yesterday. After looking at how the constraints need to be specified in their employee scheduling example, and based on the issues we are running into with the sandhi splitter, I am starting to think that learning constraints automatically using training data might be a better option in the long run, compared to having to write up the constraints ourselves. There are challenges with this approach as well - may need a lot of training data, and computation resources initially, but it appears to be where a lot of fields are headed.

kmadathil commented 7 years ago

I've coded up a basic python-constraints based Morphological Analyzer. Please take a look. (branch morpho - morphological_analyzer/SanskritMorphologicalAnalyzer.py)

In the example below, a couple of the splits have been eliminated (no morphology shown) by the coded rules (which are currently probably overaggressive)

(morpho)*$ python SanskritMorphologicalAnalyzer.py 'sAnugacCati' --need-lakara --debug Input String: sAnugacCati Input String in SLP1: sAnugacCati Start Split: 2017-08-03 14:58:38.243895 End DAG generation: 2017-08-03 14:58:38.280022 End pathfinding: 2017-08-03 14:58:38.282080 Splits: Split: [sa, anu, gacCati] [(sa, ('tad', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (anu, ('anu', set([upasargaH]))), (gacCati, ('gam', set([law, kartari, ekavacanam, ekavacanam, praTamapuruzaH, prATamikaH])))] [(sa, ('tad', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (anu, ('anu', set([kriyAviSezaRam, avyayam]))), (gacCati, ('gam', set([law, kartari, ekavacanam, ekavacanam, praTamapuruzaH, prATamikaH])))] [(sa, ('tad', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (anu, ('anu', set([cAdiH, avyayam]))), (gacCati, ('gam', set([law, kartari, ekavacanam, ekavacanam, praTamapuruzaH, prATamikaH])))] Split: [sA, anu, gacCati] [(sA, ('tad', set([praTamAviBaktiH, strIliNgam, ekavacanam]))), (anu, ('anu', set([upasargaH]))), (gacCati, ('gam', set([law, kartari, ekavacanam, ekavacanam, praTamapuruzaH, prATamikaH])))] [(sA, ('tad', set([praTamAviBaktiH, strIliNgam, ekavacanam]))), (anu, ('anu', set([kriyAviSezaRam, avyayam]))), (gacCati, ('gam', set([law, kartari, ekavacanam, ekavacanam, praTamapuruzaH, prATamikaH])))] [(sA, ('tad', set([praTamAviBaktiH, strIliNgam, ekavacanam]))), (anu, ('anu', set([cAdiH, avyayam]))), (gacCati, ('gam', set([law, kartari, ekavacanam, ekavacanam, praTamapuruzaH, prATamikaH])))] Split: [sA, nu, gacCati] [(sA, ('tad', set([praTamAviBaktiH, strIliNgam, ekavacanam]))), (nu, ('nu#2', set([avyayam, nipAtam]))), (gacCati, ('gam', set([law, kartari, ekavacanam, ekavacanam, praTamapuruzaH, prATamikaH])))] Split: [sa, A, nu, gacCati] Split: [sA, A, nu, gacCati]

kmadathil commented 7 years ago

Basic set of constraints implemented so far 1) Must not have more than 1 lakara (optionally, must have one) 2) Upasargas must be before lakaras 3) padas in prathamAvibhaktiH must match purusha / vacana of lakara (These are a bit over-restrictive, but we'll relax them as we keep going) Next rule to try is that all padas in the same vibhaktiH must match in linga/vachana

So we could do something like this Step 1: Prune the lexical DAG removing stuff that we can remove (nothing done yet) Step 2: Generate paths, and prune paths using path constraints (as above) Step 3: Rank remaining paths using DCS co-occurence

For now, I've noted the idea about learning constraints. Let's see how this develops

kmadathil commented 7 years ago

See how 'ahaM kurmaH' is morphologically split. There seem to be two valid splits :-) (The last one can be removed by a constraint that samAsapUrvapadas must precede samAsapUrvapadas or subantas only

(morpho)$ python SanskritMorphologicalAnalyzer.py 'ahaM kurmaH' Input String: ahaM kurmaH Input String in SLP1: ahaM kurmaH Start Split: 2017-08-09 12:22:14.779069 End DAG generation: 2017-08-09 12:22:14.781126 End pathfinding: 2017-08-09 12:22:14.781586 Splits: Split: [aham, kurmas] [(aham, ('ahan', set([puMlliNgam, dvitIyAviBaktiH, ekavacanam]))), (kurmas, ('kf#1', set([bahuvacanam, uttamapuruzaH, law, prATamikaH, bahuvacanam, kartari])))] [(aham, ('ahan', set([napuMsakaliNgam, dvitIyAviBaktiH, ekavacanam]))), (kurmas, ('kf#1', set([bahuvacanam, uttamapuruzaH, law, prATamikaH, bahuvacanam, kartari])))] [(aham, ('asmad', set([samAsapUrvapadanAmapadam]))), (kurmas, ('kf#1', set([bahuvacanam, uttamapuruzaH, law, prATamikaH, bahuvacanam, kartari])))]

gasyoun commented 7 years ago

Is it stuck?

kmadathil commented 7 years ago

Not really. I've been off this for a few weeks due to travel etc. Next one to tackle is #41

kmadathil commented 7 years ago

I have uploaded version 0.0.1.dev4 to pypi, including the current version of the morphological analyzer

kmadathil commented 7 years ago

(morpho)$ python -m sanskrit_parser.morphological_analyzer.SanskritMorphologicalAnalyzer 'tattvamasishvetaketo' --need-lakara Input String: tattvamasishvetaketo Input String in SLP1: tattvamasiSvetaketo Start Split: 2017-10-04 21:30:23.743627 End DAG generation: 2017-10-04 21:30:23.763555 End pathfinding: 2017-10-04 21:30:23.768574 Splits: Lexical Split: [tattvam, asi, Sveta, keto] Valid Morphologies [(tattvam, ('tattva', set([napuMsakaliNgam, praTamAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tattvam, ('tattva', set([napuMsakaliNgam, praTamAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([samAsapUrvapadanAmapadam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tattvam, ('tattva', set([napuMsakaliNgam, dvitIyAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tattvam, ('tattva', set([napuMsakaliNgam, dvitIyAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([samAsapUrvapadanAmapadam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] Lexical Split: [tat, tvam, asi, Sveta, keto] Valid Morphologies [(tat, ('tad', set([napuMsakaliNgam, praTamAviBaktiH, ekavacanam]))), (tvam, ('yuzmad', set([saNKyA, praTamAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tat, ('tad', set([napuMsakaliNgam, dvitIyAviBaktiH, ekavacanam]))), (tvam, ('yuzmad', set([saNKyA, praTamAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tat, ('tad', set([samAsapUrvapadanAmapadam]))), (tvam, ('yuzmad', set([saNKyA, praTamAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tat, ('tad', set([napuMsakaliNgam, praTamAviBaktiH, ekavacanam]))), (tvam, ('yuzmad', set([saNKyA, praTamAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([samAsapUrvapadanAmapadam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tat, ('tad', set([napuMsakaliNgam, dvitIyAviBaktiH, ekavacanam]))), (tvam, ('yuzmad', set([saNKyA, praTamAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([samAsapUrvapadanAmapadam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tat, ('tad', set([samAsapUrvapadanAmapadam]))), (tvam, ('yuzmad', set([saNKyA, praTamAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([samAsapUrvapadanAmapadam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] Lexical Split: [tattvam, asi, Sva, Ita, keto] No valid morphologies for this split Lexical Split: [tat, tu, amasi, Sveta, keto] Valid Morphologies [(tat, ('tad', set([napuMsakaliNgam, praTamAviBaktiH, ekavacanam]))), (tu, ('tu', set([avyayam, saMyojakaH]))), (amasi, ('am', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tat, ('tad', set([napuMsakaliNgam, dvitIyAviBaktiH, ekavacanam]))), (tu, ('tu', set([avyayam, saMyojakaH]))), (amasi, ('am', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tat, ('tad', set([napuMsakaliNgam, praTamAviBaktiH, ekavacanam]))), (tu, ('tu', set([avyayam, saMyojakaH]))), (amasi, ('am', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([samAsapUrvapadanAmapadam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tat, ('tad', set([napuMsakaliNgam, dvitIyAviBaktiH, ekavacanam]))), (tu, ('tu', set([avyayam, saMyojakaH]))), (amasi, ('am', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([samAsapUrvapadanAmapadam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] Lexical Split: [tattvam, asi, SvA, Ita, keto] No valid morphologies for this split Lexical Split: [tattvam, asi, Sva, ita, keto] No valid morphologies for this split Lexical Split: [tattvam, asi, SvA, ita, keto] No valid morphologies for this split Lexical Split: [tat, tvam, asi, Sva, Ita, keto] No valid morphologies for this split Lexical Split: [tat, tu, amasi, Sva, Ita, keto] No valid morphologies for this split Lexical Split: [tat, tvam, asi, SvA, Ita, keto] No valid morphologies for this split

kmadathil commented 7 years ago

Getting there ... the only issues I see with the above split are the persistence of tad in dvitIyA due to there being a sakarmaka version of अस् (which is divaadi, and hence shouldn't collide, but we aren't able to use that information yet)

kmadathil commented 6 years ago

Closing - moving discussion to #65