Closed kmadathil closed 6 years ago
I haven't looked at the PCRW algorithm mentioned in Paper 4 in your doc (Amrith Krishna, et al.), but is that simple enough to implement in this context?
Otherwise, can we split this into two stages of enforcing local constraints vs global constraints?
for node in G:
for tag of node:
for child in node.children:
for child_tag of child:
if tag, child_tag satisfies local constraints:
MG.add_edge(((node, tag), (child, child_tag)))
for path in MG.shortest_simple_paths(start, end):
if path satisfies global constraints:
yield path
It will involve creating a larger graph and thereby potentially increase the no. of paths. However, we can then use a simpler checker to filter out invalid paths. Caveat: The computational overhead of finding paths in the larger graph might be significant, and we may end up throwing away many of the paths in the filtering step
I will think about this and see if there are better solutions
Another option I can think of is to train a neural network (such as the seq2seq models) that takes in the paths/data from the lexical graph and outputs the probability of it being a valid sentence. This would require a lot of data to train the NN though.
Things to look at 1) Finite domain constraint programming over paths already extracted from the LG 2) Graph algorithms for constrained path search on the MG
Approach 1) is what I was thinking of. You seem to be thinking of approach 2) if I read the comment right.
Either can work, I think - it's a matter of finding the right algorithms. I think a good bit of reading is required here, as we explore these two approaches.
For finite domain constraint programming, we have two main options -- google or tools and python-constraint.
Yes, my comment is more along the lines of 2, but trying to decompose it into simpler problems (like you suggested in your original post) at the cost of higher computation.
I did look at google or tools documentation yesterday. After looking at how the constraints need to be specified in their employee scheduling example, and based on the issues we are running into with the sandhi splitter, I am starting to think that learning constraints automatically using training data might be a better option in the long run, compared to having to write up the constraints ourselves. There are challenges with this approach as well - may need a lot of training data, and computation resources initially, but it appears to be where a lot of fields are headed.
I've coded up a basic python-constraints based Morphological Analyzer. Please take a look. (branch morpho - morphological_analyzer/SanskritMorphologicalAnalyzer.py)
In the example below, a couple of the splits have been eliminated (no morphology shown) by the coded rules (which are currently probably overaggressive)
(morpho)*$ python SanskritMorphologicalAnalyzer.py 'sAnugacCati' --need-lakara --debug Input String: sAnugacCati Input String in SLP1: sAnugacCati Start Split: 2017-08-03 14:58:38.243895 End DAG generation: 2017-08-03 14:58:38.280022 End pathfinding: 2017-08-03 14:58:38.282080 Splits: Split: [sa, anu, gacCati] [(sa, ('tad', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (anu, ('anu', set([upasargaH]))), (gacCati, ('gam', set([law, kartari, ekavacanam, ekavacanam, praTamapuruzaH, prATamikaH])))] [(sa, ('tad', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (anu, ('anu', set([kriyAviSezaRam, avyayam]))), (gacCati, ('gam', set([law, kartari, ekavacanam, ekavacanam, praTamapuruzaH, prATamikaH])))] [(sa, ('tad', set([praTamAviBaktiH, puMlliNgam, ekavacanam]))), (anu, ('anu', set([cAdiH, avyayam]))), (gacCati, ('gam', set([law, kartari, ekavacanam, ekavacanam, praTamapuruzaH, prATamikaH])))] Split: [sA, anu, gacCati] [(sA, ('tad', set([praTamAviBaktiH, strIliNgam, ekavacanam]))), (anu, ('anu', set([upasargaH]))), (gacCati, ('gam', set([law, kartari, ekavacanam, ekavacanam, praTamapuruzaH, prATamikaH])))] [(sA, ('tad', set([praTamAviBaktiH, strIliNgam, ekavacanam]))), (anu, ('anu', set([kriyAviSezaRam, avyayam]))), (gacCati, ('gam', set([law, kartari, ekavacanam, ekavacanam, praTamapuruzaH, prATamikaH])))] [(sA, ('tad', set([praTamAviBaktiH, strIliNgam, ekavacanam]))), (anu, ('anu', set([cAdiH, avyayam]))), (gacCati, ('gam', set([law, kartari, ekavacanam, ekavacanam, praTamapuruzaH, prATamikaH])))] Split: [sA, nu, gacCati] [(sA, ('tad', set([praTamAviBaktiH, strIliNgam, ekavacanam]))), (nu, ('nu#2', set([avyayam, nipAtam]))), (gacCati, ('gam', set([law, kartari, ekavacanam, ekavacanam, praTamapuruzaH, prATamikaH])))] Split: [sa, A, nu, gacCati] Split: [sA, A, nu, gacCati]
Basic set of constraints implemented so far 1) Must not have more than 1 lakara (optionally, must have one) 2) Upasargas must be before lakaras 3) padas in prathamAvibhaktiH must match purusha / vacana of lakara (These are a bit over-restrictive, but we'll relax them as we keep going) Next rule to try is that all padas in the same vibhaktiH must match in linga/vachana
So we could do something like this Step 1: Prune the lexical DAG removing stuff that we can remove (nothing done yet) Step 2: Generate paths, and prune paths using path constraints (as above) Step 3: Rank remaining paths using DCS co-occurence
For now, I've noted the idea about learning constraints. Let's see how this develops
See how 'ahaM kurmaH' is morphologically split. There seem to be two valid splits :-) (The last one can be removed by a constraint that samAsapUrvapadas must precede samAsapUrvapadas or subantas only
(morpho)$ python SanskritMorphologicalAnalyzer.py 'ahaM kurmaH' Input String: ahaM kurmaH Input String in SLP1: ahaM kurmaH Start Split: 2017-08-09 12:22:14.779069 End DAG generation: 2017-08-09 12:22:14.781126 End pathfinding: 2017-08-09 12:22:14.781586 Splits: Split: [aham, kurmas] [(aham, ('ahan', set([puMlliNgam, dvitIyAviBaktiH, ekavacanam]))), (kurmas, ('kf#1', set([bahuvacanam, uttamapuruzaH, law, prATamikaH, bahuvacanam, kartari])))] [(aham, ('ahan', set([napuMsakaliNgam, dvitIyAviBaktiH, ekavacanam]))), (kurmas, ('kf#1', set([bahuvacanam, uttamapuruzaH, law, prATamikaH, bahuvacanam, kartari])))] [(aham, ('asmad', set([samAsapUrvapadanAmapadam]))), (kurmas, ('kf#1', set([bahuvacanam, uttamapuruzaH, law, prATamikaH, bahuvacanam, kartari])))]
Is it stuck?
Not really. I've been off this for a few weeks due to travel etc. Next one to tackle is #41
I have uploaded version 0.0.1.dev4 to pypi, including the current version of the morphological analyzer
(morpho)$ python -m sanskrit_parser.morphological_analyzer.SanskritMorphologicalAnalyzer 'tattvamasishvetaketo' --need-lakara Input String: tattvamasishvetaketo Input String in SLP1: tattvamasiSvetaketo Start Split: 2017-10-04 21:30:23.743627 End DAG generation: 2017-10-04 21:30:23.763555 End pathfinding: 2017-10-04 21:30:23.768574 Splits: Lexical Split: [tattvam, asi, Sveta, keto] Valid Morphologies [(tattvam, ('tattva', set([napuMsakaliNgam, praTamAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tattvam, ('tattva', set([napuMsakaliNgam, praTamAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([samAsapUrvapadanAmapadam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tattvam, ('tattva', set([napuMsakaliNgam, dvitIyAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tattvam, ('tattva', set([napuMsakaliNgam, dvitIyAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([samAsapUrvapadanAmapadam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] Lexical Split: [tat, tvam, asi, Sveta, keto] Valid Morphologies [(tat, ('tad', set([napuMsakaliNgam, praTamAviBaktiH, ekavacanam]))), (tvam, ('yuzmad', set([saNKyA, praTamAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tat, ('tad', set([napuMsakaliNgam, dvitIyAviBaktiH, ekavacanam]))), (tvam, ('yuzmad', set([saNKyA, praTamAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tat, ('tad', set([samAsapUrvapadanAmapadam]))), (tvam, ('yuzmad', set([saNKyA, praTamAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tat, ('tad', set([napuMsakaliNgam, praTamAviBaktiH, ekavacanam]))), (tvam, ('yuzmad', set([saNKyA, praTamAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([samAsapUrvapadanAmapadam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tat, ('tad', set([napuMsakaliNgam, dvitIyAviBaktiH, ekavacanam]))), (tvam, ('yuzmad', set([saNKyA, praTamAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([samAsapUrvapadanAmapadam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tat, ('tad', set([samAsapUrvapadanAmapadam]))), (tvam, ('yuzmad', set([saNKyA, praTamAviBaktiH, ekavacanam]))), (asi, ('as#1', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([samAsapUrvapadanAmapadam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] Lexical Split: [tattvam, asi, Sva, Ita, keto] No valid morphologies for this split Lexical Split: [tat, tu, amasi, Sveta, keto] Valid Morphologies [(tat, ('tad', set([napuMsakaliNgam, praTamAviBaktiH, ekavacanam]))), (tu, ('tu', set([avyayam, saMyojakaH]))), (amasi, ('am', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tat, ('tad', set([napuMsakaliNgam, dvitIyAviBaktiH, ekavacanam]))), (tu, ('tu', set([avyayam, saMyojakaH]))), (amasi, ('am', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tat, ('tad', set([napuMsakaliNgam, praTamAviBaktiH, ekavacanam]))), (tu, ('tu', set([avyayam, saMyojakaH]))), (amasi, ('am', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([samAsapUrvapadanAmapadam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] [(tat, ('tad', set([napuMsakaliNgam, dvitIyAviBaktiH, ekavacanam]))), (tu, ('tu', set([avyayam, saMyojakaH]))), (amasi, ('am', set([law, ekavacanam, kartari, prATamikaH, maDyamapuruzaH]))), (Sveta, ('Sveta', set([samAsapUrvapadanAmapadam]))), (keto, ('ketu', set([puMlliNgam, saMboDanaviBaktiH, ekavacanam])))] Lexical Split: [tattvam, asi, SvA, Ita, keto] No valid morphologies for this split Lexical Split: [tattvam, asi, Sva, ita, keto] No valid morphologies for this split Lexical Split: [tattvam, asi, SvA, ita, keto] No valid morphologies for this split Lexical Split: [tat, tvam, asi, Sva, Ita, keto] No valid morphologies for this split Lexical Split: [tat, tu, amasi, Sva, Ita, keto] No valid morphologies for this split Lexical Split: [tat, tvam, asi, SvA, Ita, keto] No valid morphologies for this split
Getting there ... the only issues I see with the above split are the persistence of tad in dvitIyA due to there being a sakarmaka version of अस् (which is divaadi, and hence shouldn't collide, but we aren't able to use that information yet)
Closing - moving discussion to #65
Some thoughts - not aiming to handle all possibilities right away
Input - Top n paths from Lexical Graph output from L2.
Step 1 - Karaka assignments
Step 2 - Use a constraint solver to pick paths (with tags) that satisfy all constraints
If this is possible to do using the Lexical Graph output of L2 directly, that might help. That would be sort of a constrained path search. I fear, though that that would make things worse in comparison to working on paths. Perhaps this can be decomposed into a set of graph problems that will simplify things?