Open ampli opened 8 years ago
At the abstract level, yes. How to make it work well in LG ... that is a harder question. So: what we call "pruning" is more or less the "forward-backward algorithm" https://en.wikipedia.org/wiki/Forward%E2%80%93backward_algorithm in signal processing. (for us, the probability of being able to connect is either 0 or 1, not a fraction. forward and backward are left and right.) There is a simplification of that algo which only moves forward, called viterbi. Let me try to sketch that; it should give you a better idea on one way to solve this problem.
So: in viterbi, there is always a set of disjoined currently-possible alternative universes. Lets ignore toeknization for the moment, assume tokenization is unique, and there aren't any tokenization alternatives. Starting at the left wall, there are parse alternatives: LEFT-WALL can connect with Wd+ or with Wi+ or with (WV+ & Qd+) or ... etc. The total list of disjuncts on LEFT-wall is the total set of parse alternatives.
Now, read in the first word. This word has a large set of disjuncts with - connectors and + connectors. If a given disjunct with a - connector cannot connector to some connector no the wall, it is pruned. If the left-wall has any + connectors of length 1, and these cannot connect to anything on the word, then these are pruned.
Next, we compute the new set of alternatives. Do this by looping over all pairs of left-wall & first-word disjuncts. For each pair of disjuncts, create a single new disjunct that merges the two, discarding the connectors that connected, For example, the merger of (WV+ & Wd+) with (Wd- & S+) is (WV+ & S+) If the merger is the empty set, that pair is discarded.
The set of all merged disjuncts become the new "current state". Now loop back, read the next word, and repeat until done.
So -- here's the good news and the bad news. The good news is that, if you have tokenization alternatives, they fit "naturally" into this scheme. So, for example, for every merged disjunct, you have a string of the words that lead up to that particular merger. This is the history of words that were seen, that lead up to that particular merged disjunct.
If you have a book on signal processing, the relation of this to the viterbi descried there should be fairly clear. Classical error-correcting codes are just bit-strings, where you received maybe a 0 or maybe a 1 due to radio noise, and the error-correcting code is like a disjunct of - connectors: If the current bit is 1, then it can only connect to 0110 or to 0101 or .. whatever is allowed by the error correcting code; else its forbidden. But if the current bit is 0, then it can only connect to 0001.. .. if no connection is possible, then that entire history is discarded. That alternative is thrown away. Viz the set of next states will be 01101 or 01011 or 00010 and it keeps going like this forever. if the radio channel is not too noisy, then all possible histories converge and are in perfect agreement for anything more than 5 bits ago (or however long your ECC is). The state of the most recent 5 bits is still in doubt. So, .. at each stage, you maintain a set of alternatives. Each new bit causes about 1/2 the old alternatives to be discarded, but then doubles the new alternatives (for 0 and 1).
Now the bad news: for link-grammar, by moving only forward, the size of the current state grows explosively large. There is some pruning, but not enough. The problem is the connector length: you've got these connector that might connect to something 20 words into the future, so you have to keep them around, unconnected for a long long time. The number of unconnected connectors gets huge.
So -- that is what I did way back when and then I put it on hold. I hope you see why this can handle tokenization alternatives in an elegant fashion.(in a holisitc fashion? not as a hack) I think I know how to cut down on the total number of current alternatives. For example, at any given time, there should not be more than about 5-7 unconnected + connectors so if there are more, that alternative is discarded. (our diagrams are never more than 5-7 tall, and there are psychological arguments why they should not be -- human short-term memory, etc.) Maybe that is enough to trim down the state. Not sure. Using costs is another way to trim: don't keep anything that has a high cost. LG costs are kind-of-like -log(probability)
If you want to implement something like this, go for it. You don't have to use the old, half-finished viterbi code.
Maybe you have some other idea. I just wanted to explain above, since it is a fairly "natural" (i.e. psychologically human-like) way of parsing, as opposed to backward-forward, which is not human-like at all.
In order to implement the viterbi algo for parsing, I first need to learn it more deeply... For actually implementing it, I will need more time since I would like to finish the current sub-projects I am still working on (speed-up, tokenization and Hebrew support).
so if there are more, that alternative is discarded
How do we decide which ones to discard?
BTW, one part of the alternative pruning is very easy to add even now: If two connectors are from subwords belonging to different alternatives (as indicated by the hier_position
vector, a thing that already exists) then they cannot match. But, as far as I can see, the other part of the alternatives constraint (that a word cannot be linked to another word which is already linked to a word from another alternative than the first word), can indeed be done only during an actual parsing (be it chart-parsing, SAT parsing, or Viterbi parsing as you describe) or as a post-processing as done now.
1) If you want to add an alternative constraint to the existing code, that's fine by me, as long as it makes things mostly faster.
2) "How do we decide which ones to discard?" Discard all current-state disjuncts that have more than 5 or 7 + connectors (by definition they have zero - connectors) If this does reduce the state size to a reasonable size, then discard all histories with high cost (i.e. the least likely parses)
3) studying Viterbi per-se can be somewhat challenging, because there are multiple definitions, some very different than others. I've never seen the one given in wikipedia; the way I learned it was different. I actually worked on a GPS radio for a while. I can see how the wikipedia defintion is similar, but its confusing because it lacks the usual lattice diagram that a textbook would use.
The lattice is usually drawn as two vertical columns of circles; one column showing the set of current states (with histories trailing behind them) and the second column showing the next set of states, with arrows connecting allowed state transitions (given that the current observed bit is a 0 or 1). For the wikipedia article, I guess the arrows also have a probability of that particular transition. The probabilites accumulate so some histories have very low probabilites. Again, for LG, cost = -log(probability) more or less; cost is additive; probs are multiplicative.
Here we are observing not 0 or 1, but some word, and a transition is allowed if the - connectors on a disjunct can connect to current state. If you can hold this in mind, you will see how the textbook descriptions match with what I'm saying.
4) implementing viterbi is low priority. I don't even know that it would work very well, it might be slow...
5) implementing virterbi could also be "trivial" if you do it like this: Suppose, instead of starting with a hard-coded left-wall, instead, I give you a left-wall that I magically got from somewhere. I then give you a compete sentence, which just happens to be a single word long. All you have to do is attach it. After you are don attaching it, you have to promise to give me all the disjuncts with + connectors that are still unconnected (because I gave you a right-wall that can connect to anything at all). That's it. I take your right-wall and by slight-of-hand, give it to as your new left-wall. you are now ready for your next one-word sentence.
We could use the existing code base for this, with almost no modification: only add support for a dynamic left-wall, and an anything-goes right wall. We don't have to do word-at-a-time, we could do, say, 5-10 words at a time. Chunk things at punctuation marks, commas, quotes, etc. Do you see how this could actually be "easy"?
6) Adding support for multiple affix seems important. Tatiana from Novosibirsk wrote back...
1) If you want to add an alternative constraint to the existing code, that's fine by me, as long as it makes things mostly faster.
This is something I indeed would like to try (because it looks so easy).
We could use the existing code base for this, with almost no modification: only add support for a dynamic left-wall, and an anything-goes right wall. We don't have to do word-at-a-time, we could do, say, 5-10 words at a time. Chunk things at punctuation marks, commas, quotes, etc. Do you see how this could actually be "easy"?
I once started implementing what I called "wildcard walls", that can connect to anything. This is one of my many unfinished branches... The basic idea was to implement a _
connector and then use @_
connectors. I needed this for a dict-based tokenization scheme... from starting to implement this I got the inspiration for the wildcard-words on which I published my experimentation in the LG discussion group. I have never returned to this branch from then...
6) Adding support for multiple affix seems important. Tatiana from Novosibirsk wrote back...
I first thought that maybe I can just finish my "virtual-machine" design for tokenization. However, as this will take time, I will just add, in a timely manner, an ad-hoc support for multi-suffixes in the existing code, in the style we already have in it.
A wild-card right-wall and dynamic left-wall could very sharply reduce the combinatoric explosion (and the parse time) of long sentences. e.g split it into shorter phrases, delimited by comma's -- each short phrase should parse quickly... stitching together, there is still a combinartoric explosion, but at least the shorter bits can now be ranked according to cost, instead of "picking random from...".
I still need to understand more the concept of "dynamic left-wall"...
By "dynamic left wall", I just mean some list of disjuncts that came from somewhere, and that list is attached to a "word" that is left of the current string of words. So, in today's code, you look up left-wall in the dict and use that. Instead, get the left-wall comes from somewhere else (in this case, it would be the wild-card right wall of the previous chunk).
I just tried pruning alternatives (because it is easy to try), and didn't observe a speedup even for Russian (ru/corpus-basic-batch
concatenated 50 times). This is expected because form_match_list()
must still filter alternatives (pruning is not perfect in eliminating all redundant disjuncts) and it does a pretty good job in practice (sane_linkage_morphism()
doesn't find any invalid combination in the batch run). I don't expect a speedup even for longer sentences because alternatives are very local.
So I will change the title of this issue to reflect the latter discussion in it.
EDIT: For dynamic left-wall, see the last posts here.
Currently, alternatives are not considered in the pruning algo.
The problem with them is that a word that got broken for several alternatives, each potentially having several tokens, may introduce a vast amount of different connectors, which in turn create many match opportunities for other words, so very little other connectors will be discarded, while much extra work will be added to the prune algo (bigger connector set tables etc.). But still most of the "potential" connector matches would be bogus, due to alternative mix, and the result is a very much parsing overhead, and much filtering work to be done by sane_morphism().
To solve this problem, while finding out if a candidate connector matches one which is in the set, we can reject connections to connectors from an inappropriate alternative.
Is this a good idea?