Grammar Learner: Identical Lexical Entries (ILE) algorithm based on multi-germ/single-disjunct entries

akolonin commented 5 years ago

Problem: Currently, Identical Lexical Entries (ILE) algorithm builds single-germ/multi-disjunct lexical entires (LE) first, and then aggregates identical ones based on unique combinations of disjuncts. That leads to fact that rarely occurring words with single disjunct or few of them are aggregated into clusters (which may be semi-random, due to the low frequency of the words).

Possible Solution: Support alternative ILE algorithm version so it builds multi-germ/single-disjunct LE first, and then aggregated identical ones, sharing unique combinations of germs.

Possible Extension: If no unique combinations of germs are found, the aggregation criteria may be relaxed merging LE that share at least give percentage of germs (some threshold value around 80%) and the non-overlapped words are taken out to separate clusters. For example, given LE1(a b c d e) and LE2 (b c d e f), aggregation would produce aggregated LE3(b c d e) and two reminders L4(a) and L5(f).

Action: Need to implement all possible options above and explore how they work for Gutenberg Children corpus with MWC=1,2,6

Background: findings described by @OlegBaskov below:

Explored why "ILE" clustering provides only 1-word clusters for the "Gutenberg Children books" corpus filtered with min_word_count > 1 -- [issue https://github.com/OlegBaskov/language-learning/issues/73]. Highlights:

Unfiltered "Gutenberg Children Books" corpus "LG-E-noQuotes" dataset contains 22067 words, of which:

8614 words are observed only once in the whole (unfiltered) corpus,
3271 words are observed twice,
10182 words are observed more than twice (46% of the total corpus).

ILE ("Identical Lexical Entries") clustering provides 18,939 grammar rules (clusters) containing 3,789 unique words, 3,416 (90%) of which are observed only once in the corpus.

18,278 "clusters" are single-word,
661 clusters contain 2 or more words, but:
- 587 clusters consist of words, observed only once in the input corpus,
- 74 clusters consist of words, observed more than once,
- 2 clusters consist of words, observed more than 3x.

Therefore "ILE cluster formation takes place only with MWC=1": more frequent words are associated with unique sets of disjuncts and are not "identical lexical entries": https://docs.google.com/spreadsheets/d/1TPbtGrqZ7saUHhOIi5yYmQ9c-cvVlAGqY14ATMPVCq4/edit#gid=624274537&range=C175

Details in Jupyter notebook static html copy: http://langlearn.singularitynet.io/data/clustering_2019/html/ILE-clustering-research-GCB-LG-E-noQuotes-2019-04-19.html

linas commented 5 years ago

ILE

I still don't understand what ILE is. Can you illustrate it with examples?

Until I understand what ILE is, I can't really comment.

dataset contains 22067 words, of which:

8614 words are observed only once in the whole (unfiltered) corpus, 3271 words are observed twice, 10182 words are observed more than twice (46% of the total corpus).

This is more or less exactly what a Zipf distribution looks like. This is normal and expected. If you wanted to be scientific, you would write: "the dataset contains 22067 words which follows a perfect Zipf distribution, deviating by X%" I do not know howto measure X right now; I've never seen such a thing before, but maybe someone invented it? If you've never created that graph before, its a worthwhile exercise.

18,278 "clusters" are single-word, 661 clusters contain 2 or more words, but: 587 clusters consist of words, observed only once in the input corpus, 74 clusters consist of words, observed more than once, 2 clusters consist of words, observed more than 3x.

This seems very very wrong. This is not the shape of natural language.

akolonin commented 5 years ago

@linas , for more details on current implementation of ILE, see https://github.com/singnet/language-learning/blob/master/src/grammar_learner/clustering.py#L276 lines 276-308

akolonin commented 5 years ago

@linas wrote: My gut intuition is that the most interesting numbers would be this:

MWC(GT) MSL(GT) PA      F1
5       2        
5       3        
5       4        
5       5        
5       10       
5       15       
5       25

because I think that "5" gets you over the hump for central-limit. But, per earlier conversation: the disjuncts need to be weighted by something, as otherwise, you will get accuracy more-or-less exactly equal to MST accuracy. Without weighting, you cannot improve on MST. The weighting is super-important to do, and discovering the best weighting scheme is one major task (is it MI, surprisingness, something else?)

Re: I just though that "Currently, Identical Lexical Entries (ILE) algorithm builds single-germ/multi-disjunct lexical entires (LE) first, and then aggregates identical ones based on unique combinations of disjuncts" is sufficient.

OK, so, by "lexical entry", I guess you mean "a single word-disjunct pair", where he disjunct connectors have not been clustered? So, yes, if they are identical, then yes, you should add together the observation counts. (It's important to keep track of observation counts; this is needed for computing MI.)

Note that, in principle, a "lexical entry" could also be a (grammatical-class, disjunct) pair, or it could be a (word,disjunct-class) pair, or a (grammatical-class, disjunct-class) pair, where "grammatical-class" is a cluster, and "disjunct-class" is a disjunct with connectors to that class (instead of connectors to individual words). And please note: what I meant by "disjnuct class" might not be the same thing as what you think it means, and so, without a lot of extra explanation, it gets confusing again.

At any rate, if you keep the clusters and aggregates in the atomspace, then the "matrix" code can compute MI's for them all. Else, you have to redesign that from scratch.

Side note: one reason I wanted everything in the atomspace, was so that I could apply the same class of algos -- computing MI, joining collections of atoms into networks, MST-like, then clustering, then recomputing MI again, etc. and leveraging that to obtain synonyms, word-senses, synonymous phrases, pronoun referents, etc. all without having to have a total redesign. To basically treat networks generically, not just networks of words, but networks of anythings, expressed as atoms.

singnet / language-learning

Grammar Learner: Identical Lexical Entries (ILE) algorithm based on multi-germ/single-disjunct entries #207