Closed drdhaval2785 closed 5 years ago
Necessary data is available in json at https://github.com/drdhaval2785/samasasplitter/tree/master/dicts/str_counter
It is a dict with startingString as key and its occurrence as value.
This is good, but it will help us reduce the complexity of graph creation rather than graph traversal / path search, which is s^n. While we will waste time searching for astyu, astyut, ... etc, that adds to the graph creation complexity, which doesn't seem to be a problem right now. Even for our longer strings, we finish DAG creation nearly instantaneously. But this is a good thing to keep in the back of our minds once that becomes limiting
@drdhaval2785 This is a neat idea. Could be combined with a trie structure (the return of trie? :)) to basically determine when a prefix is no longer in the db. While it is not the bottle-neck right now, once we integrate with all the sandhi rules coded up, we might generate too many splits at a given location. It might not harm to implement it. Can we tag this issue as revisit later?
There is one hidden possibility here. As you can see, most of the paths end before u, the probability of a split is high there. In a way we will have weighted graph.
@kmadathil
Are graph creation and traversal not related? If we reduce the split places, would traversal time also not reduce?
@drdhaval2785
What you're optimizing can be called the phonological graph.
The graph we're traversing is the lexical graph - the one that's created after all legal splits have been decided. For example, for astyuttarasyAmdiSi, this is the graph
asti +------------+-------------+--------+ | | | | uttarasyAm uttas uttara ut | | | | | rasyAm syAm tara | | | | +-------------- diSi------- +---------+ | End
(edit: The graph comes out horribly, I'll figure out a way to get it looking better)
This graph is created from the phonological graph by the getSandhiSplit method - using the process of figuring out each possible legal left split, then recursively splitting the right part to see if there's a legal split (with memoization to speed things up). Your idea will speed the creation of this graph by allowing us to consider fewer possibilities while creating the graph. Once created, every node in it is legal. Currently, this is not the complexity bottleneck. Creation of the lexical graph happens very fast, though we're not perfectly optimal in the implementation.
Traversing (using the findAllPaths method) the lexical graphs creates this set of paths. This is the part that has exponential complexity, and therefore the one that limits our performance.
[[u'asti', u'uttarasyAm', u'diSi'], [u'asti', u'uttas', u'rasyAm', u'diSi'], [u'asti', u'uttara', u'syAm', u'diSi'], [u'asti', u'ut', u'tara', u'syAm', u'diSi']]
If we can extend your idea to find was to simplify the lexical graph during or after creation, but before traversal, that would help us. If we could find constraints to put on the graph which would result in deletion of nodes or edges (of the lexical graph, not the phonological one), that would help us.
Can you post nodes and edges (maybe adjacency matrix) for small samasa? This will help understand what kind of constraints can be put from grammatical or corpus linguistics point of view?
Does that help? (created using graphviz)
Here's another one if it helps. Ignore the weird behavior of the A on the right. When the graph is exported to the .dot format, it only uses node names, and so the A gets a lot of edges, even though there are probably different A nodes at different points in the splits.
I think this does show that if we can avoid the weird splits like A at some points, it might help simplify the graph and reduce the number of paths. I had noticed something similar earlier and modified the sandhi rules to account for this. Will test what happens with that and report back.
There is something more worriesome about the A. Can we keep a constraint that a split may not be of one letter?
The words with 'A' prefix are anyhow separately enumerated in the dataset e.g. AgacCati, Agama etc. There is no need to keep A as a separate node. Node should be created onlycif string length is more than 1. This will reduce too many spurious splits
Second idea is to use the the following as weight of edges
1/(length_of_string_of_destination_node)
.
E.g. edge (asti,uttarasyAm) will have weight of 1/11 and edge (asti,ut) will have weight of 1/2.
This will ensure that larger splits are facilitated and shorter splits are discouraged in considering paths.
We can then use the various algorithms for weighted graphs. This will give us graded splits i.e. paths with least resistance would be most preferrred.
What are uour takes @avinashvarna and @kmadathil?
I guess I understand where is it going haywire. https://networkx.readthedocs.io/en/stable/tutorial/tutorial.html mentions
we add new nodes/edges and NetworkX quietly ignores any that are already present.
Therefore when we added 'A' node again, networkx ignored that and kept only one 'A'. A simple workaround maybe to name the nodes with numbers and storing the string as node attribute. Please keep it a try.
After weightage and number nodes, let us see how it fares.
Changing the weight of the edges may help us sort the paths according to some criterion, but as far as I am aware, finding the paths according to lowest weights is at least as complex as finding the shortest path, so it doesn't really solve the problem.
Regarding the A node, this is an artifact of the exporting to a different format, as I mentioned earlier. This does not happen in the actual implementation. Moreover, with @kmadathil's fix for #21, there should not even be duplicate nodes.
I do agree that improving the quality of our sandhi splits is really the right way to reduce the graph traversal time. If we have to retrieve 10K paths to find our split, our retrieval precision is 10^-4, which is pretty bad. I think we need to focus on improving the splits by adding constraints if that is indeed the case. Your suggestion of preventing single letter splits for example is a good one (though I hope that the sandhi module already addresses this. I will check).
I am actually more interested in the 'diSi' node in the above graph. This is an articulation point (https://www.hackerearth.com/practice/algorithms/graphs/articulation-points-and-bridges/tutorial/). In other words, every path from asti to the end HAS to go through diSi. So we could actually split the problem of finding paths in the graph to two problems: asti --- diSi, and diSi --- end. Since the number of paths is exponential in the number of nodes, finding the paths in the smaller graphs will be faster, and then we can recombine the results to obtain the full paths.
Unfortunately, I don't think all the results we generate will have such articulation points. Might be something worth exploring.
I think nodes are still letter based.
def addNode(self,node,root=False):
""" Extend dag with node inserted at root """
assert node not in self.G
self.G.add_node(node)
if root:
self.roots.append(node)
This shows that nodes are lettet based
if s_c_left not in node_cache:
# Extend splits list with s_c_left appended with possible splits of s_c_right
t = SanskritBase.SanskritObject(s_c_left,encoding=SanskritBase.SLP1)
node_cache[s_c_left] = t
if not splits:
splits = SanskritLexicalGraph()
if not splits.hasNode(t):
splits.addNode(t,root=True)
t = SanskritBase.SanskritObject(s_c_left,encoding=SanskritBase.SLP1)
It's an object with that name. Two objects with the same name will not be identical.
@avinashvarna is right. The reason I created SanskritObjects for nodes is to solve the precise problem of a word appearing twice in a sentence having to be two different nodes.
Articulation points are good ideas. Every space is an articulation point, and that could make things easier.
I still have my reservations. We are doing cacheing. So when I search for 'A' in cache, I will always retrieve the old cached node. Our new object creation happens only when that node is not in cache.
@drdhaval2785 - We will eventually need weights, like you suggested. Right now, as @avinashvarna says, shortest path finding is equivalent to weighted search. The node cache is keyed by string, but cleared at each Sandhi split point. Nodes with the same string, at the same split point, are identical.
Proof. If both 'tvam's are the same node, there'll be a single word path 'tvam'-end and loops. We get a single valid path here.
(dag-nx)*$ python SanskritLexicalAnalyzer.py --split --input-encoding SLP1 tvaMcatvam --max 20 Parsing of XMLs started at 2017-07-13 09:24:28.927559 666994 forms cached for quick search Parsing of XMLs completed at 2017-07-13 09:24:33.883752 Input String: tvaMcatvam Input String in SLP1: tvaMcatvam Start Split: 2017-07-13 09:24:39.341010 End DAG generation: 2017-07-13 09:24:39.342314 End pathfinding: 2017-07-13 09:24:39.343026 [[u'tvam', u'ca', u'tvam']]
@drdhaval2785 AgacCAmi is not in Prof. Huet's database.
$python SanskritLexicalAnalyzer.py AgacCAmi Parsing of XMLs started at 2017-07-13 09:26:43.422954 666994 forms cached for quick search Parsing of XMLs completed at 2017-07-13 09:26:48.320092 Input String: AgacCAmi Input String in SLP1: AgacCAmi None
$ python SanskritLexicalAnalyzer.py AgacCAmi --split Parsing of XMLs started at 2017-07-13 09:26:59.823000 666994 forms cached for quick search Parsing of XMLs completed at 2017-07-13 09:27:04.781413 Input String: AgacCAmi Input String in SLP1: AgacCAmi Start Split: 2017-07-13 09:27:10.275910 End DAG generation: 2017-07-13 09:27:10.277353 End pathfinding: 2017-07-13 09:27:10.277907 [[u'A', u'gacCAmi']]
@drdhaval2785 Your misgivings about node_cache were not misplaced. There were issues in the cacheing, which I have fixed.
Any speedups after bug fix?
Speed is similar, some incorrect funtionality is now corrected.
Anything further to be done here? Or time to close?
Time to close, I think.
I found it difficult to represent on screen. So scribbled a note by hand.