`Phrases` greedy detection combines 1st, not necessary highest-scoring, overlapping bigrams/multigrams

piskvorky / gensim

Topic Modelling for Humans

https://radimrehurek.com/gensim

GNU Lesser General Public License v2.1

15.68k stars 4.38k forks source link

`Phrases` greedy detection combines 1st, not necessary highest-scoring, overlapping bigrams/multigrams #1719

Open init-random opened 7 years ago

init-random commented 7 years ago

Description

See below, but the most common phrase is not being detected as a bigram by Phrases.

Steps/Code/Corpus to Reproduce

s = ['i went to toys r us',
     'we are at babies r us',
     'where is babies r us',
     'do you mean toys r us']
s = [_.split() for _ in s]

phrases = Phrases(s, min_count=1, threshold=2)
list(phrases.export_phrases(s))

Actual Results

# [(b'toys r', 3.625),
#  (b'babies r', 3.625),
#  (b'babies r', 3.625),
#  (b'toys r', 3.625)]

Expected Results

Given that the phrase "r us" is the most common in the sentences, I would expect that this would be a common extracted phrase. There is mention in #794 that there is intent to avoid overlapping bigrams. Depending on the use case, overlapping bigrams may not be a problem if all of them pass the specified threshold. Maybe consider an overlapping_phrases flag?

Versions

Windows-7-6.1.7601-SP1 Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)] NumPy 1.13.3 SciPy 1.0.0 gensim 2.3.0 FAST_VERSION 0

menshikh-iv commented 7 years ago

Thanks for the report @init-random, your problem reproduced with the latest version of gensim.

piskvorky commented 7 years ago

IMO Phrases need a proper rewrite (both for performance and for cleaning up the logic, the API). Also related to the Bounter optimizations.

gojomo commented 6 years ago

Phrase detection is 'greedy'; if a token is consumed by an earlier-in-the-text phrase meeting the current threshold, it won't be available for a later phrase, even if the later phrase is 'stronger' (would be found even at a tougher threshold). As far as I know the phrases-detection in Google's word2vec.c release, which I believe inspired this code, behaves the same way.

init-random commented 6 years ago

@gojomo thank you for the explanation. Yes, that does seem to be the case.

>> cat in.txt 
i went to toys r us
we are at babies r us
where is babies r us
do you mean toys r us
>> ./word2phrase -train in.txt -output phrases.txt -threshold 1 -min-count 1
>> cat phrases.txt 
 i went to toys_r us
 we are at babies_r us
 where is babies_r us
 do you mean toys_r us

I guess my underlying assumption in using this was that it was looking at a sliding window of bigrams as described in figure 5.1 of the Chris Manning book https://nlp.stanford.edu/fsnlp/promo/colloc.pdf. Also, in the bounter bigram count examples it uses a sliding window.

It seems like the greedy approach might be an engineering/optimization decision? Which might very well be the case given the word2phrase is targeted at billions of words.

If this is a feature, it might be nice to have an option to count/consider all bigrams?

gojomo commented 6 years ago

Picking the stronger pairing does intuitively seem the better policy. The lookahead overhead shouldn't be too much, though you could imagine/contrive situations where the whole text's pairs would need to be scored before a single phrase-promotion occurs. (eg: every pair is over the threshold, but each slightly stronger than the one before, through to the end of the text.)

I wonder if in such a case iterative promotion to even-longer phrases could be checked immediately, perhaps via a merged/flattened frequencies-dictionary, instead of the current model of it happening in a completely different iteration. (That is, if a b c d e becomes a b c d e, the potential to phraseify (c, d_e) then is checked right away, from the same frequency info – rather than in some other pass using a separate frequency dictionary. And in that case (c, d_e) has a fair chance at outscoring (b, c), which it doesn't get a chance to do in two-separate-layered-passes.)

init-random commented 6 years ago

I don't see why you couldn't, but I can think of a few questions:

is the max size ngram a hyperparameter
in subsequent iterations the threshold is typically reduced. If done in a single pass would the threshold be a function of the ngram size, e.g. if T is the threshold for bigrams then the threshold for trigrams could be 0.67 * T
would all 1- to max-n-grams need to be stored; I imagine storage for larger ngrams could be optimized based on counts of their "sub ngrams" e.g. if a b is frequent but b c is not, then a b c may not be frequent (depending on threshold)
what do you do if a b and b c both pass the threshold, but a b c does not; would the higher score bigram get exported

gojomo commented 2 years ago

On duplicate #3367, @chaturv3di asks:

Is there value in implementing this, optimised [prefers-higher-scoring-bigram] version of analyze_sentence()?

[Update] PS: Would it be okay to (eventually) raise a PR for this?

If the alternate behavior doesn't cost too much more, and the case for its superiority is strong, a PR would be welcome. (As I mentioned in an earlier coment, this does seem intuitively the better policy.)

If the cost is noticeable – or it's determined that retaining compatibility with the original behavior (from the Google papers & word2vec.c release that were the model for this functionality ) is important – then we might want to include an option to keep the older behavior.