Hybrid sequential-MST parser

akolonin commented 5 years ago

Implement "hybrid" parser blending sequential information and MI, so the extend of blending could be made configurable, with "maximum sequential" mode producing "sequential parse" and "maximum MI" mode producing "plain MST-Parses with no account for distance".

There are two perspectives: A) As Ben Goertzel has suggested, ("use both the sequential parse and some fancier hierarchical parse as inputs to clustering and grammar learning? I.e. don't throw out the information of simple before-and-after co-occurrence, but augment it with information from the statistically inferred dependency parse tree") we can be simply (I guess) have it implemented in existing MST-Parser given the changes that @glicerico and Claudia have done year ago. That could be tried with "distance_vs_MI" blending parameter in the MST-Parser code which accounts for word-to-word distance. So that if the distance_vs_MI=1.0 we would get "sequential parses", distance_vs_MI=0.0 would produce "Pure MST-Parses", distance_vs_MI=0.7 would provide "English parses", distance_vs_MI=0.5 would provide "Russian parses".

B) As Ben Goertzel further wrote:

I don't think we want an arithmetic average of distance and MI, maybe more like

f(1) = C >1 f(1) > f(2) > f(3) > f(4) f(4) = f(5) = ... = 1

and then

f(distance) * MI

i.e. maybe we count the MI significantly more if the distance is small... but if MI is large and distance is large, we still count the MI a lot...

(of course the decreasing function f becomes the thing to tune here...)

The task can be broken down to subtasks: 1) Implement configurable blending of sequential and MI information using approach A) or B) or combination of the two above. 2) Implement unit test ensuring that it can provide either sequential or MST or hybrid parses on small corpus like POC-English or POC-Turtle. 3) Study F1 of the parses based o Gutenberg Children corpus and see if we can find configuration outperforming "sequential parses". 4) Extend study 3) using both traditional MI and DNN-MI.

glicerico commented 5 years ago

@linas, this issue (esp. part B) relates to your proposal of using a generic function to weight distance between words while pair-counting and mst-parsing. I remember you had a specific structure on how to augment the atoms. Do you want to share it? I could try to implement it.

linas commented 5 years ago

specific structure

When I count, first I pick the kind of thing to count, usually having the form (List (LeftAtom) (RightAtom)) for example (List (Word "foo")(Word "bar")) (It doesn't have to be words, it can be anything) Then whenever I see the foo-bar pair, I would increment the CountTruthValue on that atom. (Using the (count-one-atom ATM) function in common.scm -- you can translate that to python and get the same effect.)

With the new Value system, you can store that count anywhere, and not just CountTruthValue; for example

(define (incr-count ATO)
   (cog-set-value! ATO  (Predicate "my-favorite-count-key")
      (FloatValue (+ 1 (cog-value-ref (cog-value ATO (Predicate "my-favorite-count-key")) 0)))))

Yikes! OK, sot that is very verbose; and maybe it is time to implement the auto-increment value .... anyway, you'd just then say (incr-count (List (Word "foo") (Word "bar"))) and the count would be updated. The point is, you could count six ways from Sunday and store those in six different places and not mix them up. So (Predicate "my-key-on-Monday") etc.

You can pass fractions, it doesn't have to be whole numbers.

After I'm done counting, I have a whole infrastructure for getting totals, marginals, probabilities, MI's, whatever. from those counts. So, for example, if I have a count N stored on (List (LeftAtom x) (RightAtom y)) -- let me write N(x,y) for short. Think of that as a sparse matrix; for almost all (x,y) the count is zero, but for one-in-a-million, its not. The tools, in (use-modules (opencog matrix)) will compute the total N(*,*) = sum_x sum_y N(x,y) and the probability-frequency P(x,y)=N(x,y)/N(*,*) and the marginals P(x,*) and P(*,y) and MI(x,y) = log_2 P(x,y)/P(x,*)P(*,y) and also other things -- sums, differences, products of rows, columns; fraction of any row-column that is non-zero, filters to kill any row/column whose count is less than M, matrix products, and methods to walk over all rows/columns/pairs.

The code is "generic" in that you just have to tell it how to find the count for LeftAtom x, RightAtom y and it does the rest. There's a lot of code there - about 6KLOC, so its non-trivial. It could be ported to Gnu R -- the AGI-BIO guys like @mjsduncan like to use Gnu R for data analysis. This is even "easy", because Gnu R has nice C++ interfaces exactly for this. In principle, you could port to SciPy./NumPy but when I looked python had exactly zero interfaces for this (cython kind-of "goes the wrong way" and would force copying - i.e. would force you to generate non-sparse vectors with a million zero entries in them. This would kill performance and overflow RAM)

So that's what I've got -- a generic way of counting, and a generic way of doing statistical analysis with those counts. The code is more-or-less "done" except for a port to Gnu R, or an occasional brand-new feature. (Well, it could be re-written in C++ to make it faster, but that would be ... hard.)

glicerico commented 5 years ago

Great @linas, so when pair-counting one could use keys specific to the distance in the observed word-pair (maybe extra to the normal ctv counts that don't care about distance), like (Predicate "obs-3-dist-apart'), (Predicate "obs-more-than-5-dist-apart")... and then use them as desired in the MI calculation phase (e.g. assign weights that depend on distance). I realize, however, that In the MST-parse phase each word-instance pair has its distance calculated in situ, so there's no need to store the dist as a value, the scorer function can just use it and throw it away immediately. So my question probably wasn't related to this issue after all :P, but makes it clear to me how to better do our distance-modified pair counting.

glicerico commented 5 years ago

Ah @linas, and I'm doing this in scheme, no need for porting at the moment ;)

linas commented 5 years ago

distance calculated in situ, so there's no need to store

Well, but you do need to store how many pairs were seen exactly 3-apart.

With the brand-new cog-set-value! function, you can use one key for everything (it will use a lot less space; way back when, the counting function kept a dozen different kinds of counts for whatever, and it exploded the size of the atomspace making it 10x or 20x larger than it needed to be and Curtis discovered this, but didn't know why, and created some out-of-this-world hacks as a result. The fix was "count only what you need to count".)

So, for example:

(define obs-key (Predicate "Counting key"))
(define dist (dist-apart ...))  ; some small integer.
(define word-pair (List (Word "foo") (Word "bar"))) 
(cog-inc-value! word-pair obs-key 1 dist)  ; increment count for distance dist
(cog-inc-value! word-pair obs-key 1 0)  ; increment total count (stored as "distance zero")

glicerico commented 5 years ago

Yeah, this seem great for observing-pairs! Thanks @linas

distance calculated in situ, so there's no need to store

Well, but you do need to store how many pairs were seen exactly 3-apart.

I was talking about the actual mst-parsing stage (https://github.com/opencog/atomspace/blob/6103c07bedbe461f77a99ac4443d8cbb3cdf9110/opencog/sheaf/mst-parser.scm#L69), do you say that I should also keep counts for word-pair distances here? This issue is related to that part of the process, with tweaking the word-instance pair score/weight to build the parse tree.

linas commented 5 years ago

do you say that I should also

No. I don't know what you might need to keep counts for.

glicerico commented 5 years ago

That's what I though, good.

glicerico commented 5 years ago

@akolonin , approach B) is implemented in: https://github.com/singnet/learn/pull/28 Steps 1) and 2) in your list above are there

singnet / language-learning

Hybrid sequential-MST parser #217