singnet / language-learning

OpenCog Unsupervised Language Learning
https://wiki.opencog.org/w/Language_learning
MIT License
32 stars 11 forks source link

Add costs to learned LG rules (for ILE) #183

Open akolonin opened 5 years ago

akolonin commented 5 years ago

Need to add costs to rules based on statistics for better parsing with account to these costs. First, this has to be done for GL ILE algorithm and we see if it helps and then it may advanced to other GL algorithms.

@linas answered on "Can you point me to the illustrative example of scored disjuncts in English LG dictionary?": Anything with square brackets. so: [A- & B+]0.05 % this has a cost of 0.05 [A- & B+] % no number -- default cost of 1.0 [[A- & B+]] % two brackets, no number -- default cost of 2.0 By default, the parser rejects disjuncts with a cost of 2.7 or higher. You can change this with an option. Costs are additive, and so costs can be (should be) thought of -log P -- that is, minus the log of a probability. The lower the cost, the higher the probability of being correct.

The question for @linas - how would we compute cost of disjuncts during Grammar Induction, should we just rely on N of appearances of a disjunct in the training corpus so the cost would be Cost=1/N?

linas commented 5 years ago

During the training phase, you can observe/collect the count of how often each disjunct got used. This not only gives you the frequency of how often a disjunct gets used, but it allows you to compute the MI for (word,disjunct) pairs. This is defined as usual: if N(w,d) is the number of times word w was observed (during training) with disjunct d, then

MI(w,d) = N(*,*) N(w,d) / N(*,d) N(w,d)

The working hypothesis is that:

1) if N(w,d) < K for maybe K=4 or K=8 then the (w,d) pair is probably garbage (it arose from bad MST parses) Don't use that disjunct, discard it.

2) The working hypothesis is that large values of MI(word,disjunct) are the best, most accurate. Thus, the link-grammar cost should be minus the MI(w,d) so that the total parse cost is the sum of these ... parse ranking cost picks the highest MI (i.e. lowest cost due to minus sign)

linas commented 5 years ago

This question:

should we just rely on N of appearances of a disjunct in the training corpus

Yes ...

so cost would be Cost=1/N?

Probably not. You can try it, I suppose, but standard probability theory suggests that cost=-MI(w,d) will be better.

The only "hard part" that I see is correctly tracking the counts during clustering. That is, If you've got N(wa,d1) N(wa,d2)... N(wb,d3)... and you decide to merge word wa with word wb then you have to sum up total counts correctly, as well. I think the scheme code for clustering does this correctly, but I also think you might not be using that code.

There is another hard part. The disjuncts d1, d2, d3... correalte well with word-senses, so if one is clever enough, you can distinguish different word senses this way. This should work more-or-less the same way that the neural-net "King - man + woman = Queen" examples work. The neural nets all use cosine distance (vector dot-products) My working hypothesis is that using information distance will give better results than cosine distance. (I think one can probably use information distance with neural nets, too, you would just have to use a different formula for combining the weights). Whatever. The word-sense project/conversation should probably be discussed elsewhere.

The "information distance" is defined in one of the PDF's (and in the scheme source). I don't quite remember it off the top of my head. Its also called the "symmetric MI" in the PDF's -- its based off of the sum

dist(wa, wb) = sum_d  N(wa, d) N(wb,d) / stuff

where wa, wb are two different words, and the sum is over all disjuncts. This distance is umm .. not obvious, until you think about it really hard, after which point, you slap your head and say, "oh, it is obvious after all"

akolonin commented 5 years ago

@linas wrote: Re: Linas, how would you "weight the disjuncts"? We know how to weight the words (by frequency), and word pairs (by MI). But how would you weight the disjuncts?

That is a very good question. There are several (many) different kinds of weighting schemes. I do not know which is best. That is the point where I last left things, half-a-year-ago, now. But first, some theoretical preliminaries.

Given any ordered pair at all -- thing-a and thing-b, you can compute MI for the pair. Thing-a does not have to be the same type as thing-b. In this case, the pair-of-interest is (word, one-of-the-disjuncts-on-that-word). Write it as (w,d) for short. The MI is defined same as always: MI(w,d) = p(w,d)/p(,d) p(w,) where p is the frequency of observation: p(w,d)=N(w,d)/N(,) as always. N is the observation count and * the wild-card sum.

The pipeline code already computes this; I'm not sure if you use it or not. Its in the (use-modules (opencog matrix)) module; it computes MI for pairs-of-anythings in the atomspace. Its generic in that one can set up thing-a to be some/any collection of atoms in the atomspace, and thing-b can be any other collection of atoms, and it will start with the counts N(thing-a, thing-b) and compute probabilities, marginal probabilities, conditional probabilities, MI, entropies, "the whole enchilada" of statistical you can do on pairs of things. Its called "matrix" because "pairs of things" looks like an ordinary matrix [M]_ij

Sounds boring, but here's the kicker: (opencog matrix) is designed to work for extremely sparse matrixes, which *every other package (e.g. scipy) will choke on. For example: if thinga-thing-b=words, and there are 100K words, then M_ij potentially has 100K x 100K = 10 giga-entries which will blow up RAM if you tried to store the whole matrix. In practice, 99.99% of them are zero (the observation count of N(left-word, right-word) is zero for almost all word pairs). So the atomspace is being used as storage for hyper-sparse matrixes, and you can layer the matrix onto the atomspace any way that you want. Its like a linear cross-section through the atomspace. linear, vector, etc. etc.

OK, so .. the existing language pipeline computes MI(w,d) already, and given a word, and a disjunct on that word, you can just look it up. ... but if you are clustering words into a cluster, then the current code does not currently recompute MI(g,d) for some word-group ("grammatical class") g. Or maybe it does recompute, but it might be incomplete or untested, or different because maybe your code is different. For the moment, let me ignore clustering....

So, for link-grammar, just take -MI(w,d) and make that the link-grammar "cost". Minus sign because larger-MI==better.

How well will that work? I dunno. This is new territory to me. Ben long insisted on "surprisingness" as a better number to work with. I have not implemented surprisingness in the matrix code; nothing computes it yet. Besides using MI, one can invent other things. I strongly believe that MI is the correct choice, but do not have any concrete proof.

If you do have grammatical clusters g, then perhaps one should use MI(w,g)+MI(g,d) or maybe just use MI(g,d) by itself. Likewise, if the disjunct 'd' is the result of collapsing-together a bunch of single-word disjuncts, maybe you should add MI(disjunct-class, single-disjunct) to the cost. I dunno. I was half-way through these experiments when Ben re-assigned me, so this is all new territory.

akolonin commented 5 years ago

As discussed with @alexei-gl :

Input parses:

мама мыла раму

мама мыла машу

мама рыла яму

мама била раму

мама била машу

=>1. space formation (using word <-> proto-disjunct matrix)=>

          мама- & раму+     мама- & машу+      мама- & яму+    мыла+ ...
мыла           1                  1                 0            0

рыла           0                  0                 1            0

била           1                  1                 0            0

мама           0                  0                 0            1

=>2. clustering=>

          мама- & раму+     мама- & машу+      мама- & яму+    мыла+ ...
C1: мыла+била   1                  1                 0            0

C2: рыла        0                  0                 1            0

C3: мама        0                  0                 0            1

=>3. grammar induction=>

          С1- & С4+     С1- & С5+      С1- & С6+    С3+ ...
С1: мыла+била   1(2?)              1(2?)             0            0

С2: рыла        0                  0                 1            0

C3: мама        0                  0                 0            1

C4: раму

C5: машу

С6: яму

C7: била

See also:

207

https://en.wikipedia.org/wiki/Formal_concept_analysis

TODO ITEMS: A) Change =>1. space formation=> so the counts are counted for each word-disjuct AND make sure the code is no broken and tests PASS (may need config option "account-parse-counts=True/False") B) See if clustering in new space (with account-parse-counts=True) makes better grammars for POCE/CDS/GC because of better clustering C) Implement MPDC (minimum proto-disjunct count), so that disjuncts appearing less than MPDC times, are excluded from the word-proto-disjunct matrix and not used in clustering and grammar induction and see if it can improve clustering POCE/CDS/GC D) Implement MDC (minimum disjunct count), so that disjuncts appearing less than MDC times, are excluded from the word-disjunct matrix (after clustering) and not used in grammar induction and see if it can improve clustering POCE/CDS/GC E) Implement MWPDC (minimum word proto-disjunct count) so that word-proto-disjunct pairs (matrix cells) appearing less than MWPDC times, are set to nulls in the word-proto-disjunct matrix and not used in clustering and grammar induction and see if it can improve clustering POCE/CDS/GC F) Implement MWDC (minimum category disjunct count) so that word-disjunct pairs (matrix cells) appearing less than MWDC times, are set to nulls in the category-disjunct matrix and not used in grammar induction and see if it can improve clustering POCE/CDS/GC G) Assign costs to disjuncts in the LG dictionary file, accordingly to values in category-disjunct matrix cells, given ideas from https://github.com/singnet/language-learning/issues/183 AND (!!!) having some more ideas how the counts in word-proto-disjunct pairs (matrix cells) are transformed into values in category-disjunct pairs (matrix cells)

akolonin commented 5 years ago

@OlegBaskov wrote:

items A-E of your bold plan were implemented in October 2018. You can just rename parameters if this seems crucial for grammar induction improvement: C. MPDC - 'min_link_count', E. MWPC - 'min_co-occurrence_count'. Letters below are your plan items, existing parameter are described in https://github.com/singnet/language-learning/tree/master/src/grammar_learner README. The F item of your plan concerning "MCDC (minimum category disjunct count)" ~ filtering disjuncts in grammar rules based on their occurrence frequency ~ proved counter-productive as it ruined smaller clusters of rare words and/or has no effect on larger clusters. The 'min_disjunct_count' parameter was replaced by 'max_disjuncts' - maximum number of disjuncts in cluster - https://github.com/singnet/language-learning/blob/601e7bc7f97a0b6c1f713f8108fc6e81d492e921/src/grammar_learner/grammar_inducer.py#L89. This parameter was not documented as you did not want any filtering... IMHO nothing of the mentioned is necessary for the item "G) Assign costs to disjuncts in the LG dictionary file...". P.S. As usual, I advise to read and understand code before any discussion and planning, Otherwise you will waste time returning to plans after reading the code before applying changes. I would not advice changing the code before understanding it...

@akolonin commented: Note, the "min_link_count" does not seem to look like "min_disjunct_count", seems like two different things.

alexei-gl commented 5 years ago

Disjunct costs (both reverse count and mutual information) were added to grammar-learner (PR #258).

Mutual information formula was simplified. @akolonin , @linas please, correct me if I am wrong about MI:

    MI(c, d) = log( N(c, d) / (N(c) * N(d)) )

where:

N(c, d) = sum([N(w1, d), N(w2, d),...N(wn, d)]) - number of times disjunct d appears with the words of the cluster c

N(c) - number of times the cluster words appear with any disjuncts

N(d) - number of times disjunct d appears with any word

According to LG dictionary rule strategy each word can only be included into one rule only. In current GL implementation each rule corresponds to one of the clusters. Each connector is a conjunction of two cluster names followed by either plus or minus sign. So a disjunct may appear in one rule only. In that case N(c, d) = N(c) and the original formula for this particular case becomes:

    MI(c, d) = log( 1 / N(d) )

The only thing that may make difference is disambiguation. Current GL code adds suffixes to disambiguated word instances, if the same word appears in the corpus in different meanings. In that case the general formula makes sense (not implemented so far).

The code was tested on CDS corpus using ALE algorithm. Simplified MI gives the best results out of three:

COST FUNC   PA  F1
no_cost_added    99.96% 0.9789
reverse_count    99.96% 0.9839
mi_simplified    99.96% 0.9850

More results can be found here: http://langlearn.singularitynet.io/data/aglushchenko_parses/GL-DJCT-WEIGHT-2019-09-26/

akolonin commented 5 years ago

That means that 1) Use of costs provides little improvement on CDS corpus (edited) 2) The improvement is little 3) More study is required using GC corpus 4) More study is required involving "link clustering" - so the smaller number "link types" and disjuncts made of "typed links" may be used across larger number of "categories" (involving FCA-kinds of clustering in words/categories<->links/disjuncts space)