Open init-random opened 7 years ago
Thanks for the report @init-random, your problem reproduced with the latest version of gensim.
IMO Phrases
need a proper rewrite (both for performance and for cleaning up the logic, the API).
Also related to the Bounter optimizations.
Phrase detection is 'greedy'; if a token is consumed by an earlier-in-the-text phrase meeting the current threshold, it won't be available for a later phrase, even if the later phrase is 'stronger' (would be found even at a tougher threshold). As far as I know the phrases-detection in Google's word2vec.c release, which I believe inspired this code, behaves the same way.
@gojomo thank you for the explanation. Yes, that does seem to be the case.
>> cat in.txt
i went to toys r us
we are at babies r us
where is babies r us
do you mean toys r us
>> ./word2phrase -train in.txt -output phrases.txt -threshold 1 -min-count 1
>> cat phrases.txt
i went to toys_r us
we are at babies_r us
where is babies_r us
do you mean toys_r us
I guess my underlying assumption in using this was that it was looking at a sliding window of bigrams as described in figure 5.1 of the Chris Manning book https://nlp.stanford.edu/fsnlp/promo/colloc.pdf. Also, in the bounter bigram count examples it uses a sliding window.
It seems like the greedy approach might be an engineering/optimization decision? Which might very well be the case given the word2phrase is targeted at billions of words.
If this is a feature, it might be nice to have an option to count/consider all bigrams?
Picking the stronger pairing does intuitively seem the better policy. The lookahead overhead shouldn't be too much, though you could imagine/contrive situations where the whole text's pairs would need to be scored before a single phrase-promotion occurs. (eg: every pair is over the threshold, but each slightly stronger than the one before, through to the end of the text.)
I wonder if in such a case iterative promotion to even-longer phrases could be checked immediately, perhaps via a merged/flattened frequencies-dictionary, instead of the current model of it happening in a completely different iteration. (That is, if a b c d e
becomes a b c d e
, the potential to phraseify (c
, d_e
) then is checked right away, from the same frequency info – rather than in some other pass using a separate frequency dictionary. And in that case (c
, d_e
) has a fair chance at outscoring (b
, c
), which it doesn't get a chance to do in two-separate-layered-passes.)
I don't see why you couldn't, but I can think of a few questions:
a b
is frequent but b c
is not, then a b c
may not be frequent (depending on threshold)a b
and b c
both pass the threshold, but a b c
does not; would the higher score bigram get exported On duplicate #3367, @chaturv3di asks:
Is there value in implementing this, optimised [prefers-higher-scoring-bigram] version of analyze_sentence()?
[Update] PS: Would it be okay to (eventually) raise a PR for this?
If the alternate behavior doesn't cost too much more, and the case for its superiority is strong, a PR would be welcome. (As I mentioned in an earlier coment, this does seem intuitively the better policy.)
If the cost is noticeable – or it's determined that retaining compatibility with the original behavior (from the Google papers & word2vec.c
release that were the model for this functionality ) is important – then we might want to include an option to keep the older behavior.
Description
See below, but the most common phrase is not being detected as a bigram by
Phrases
.Steps/Code/Corpus to Reproduce
Actual Results
Expected Results
Given that the phrase "r us" is the most common in the sentences, I would expect that this would be a common extracted phrase. There is mention in #794 that there is intent to avoid overlapping bigrams. Depending on the use case, overlapping bigrams may not be a problem if all of them pass the specified threshold. Maybe consider an overlapping_phrases flag?
Versions
Windows-7-6.1.7601-SP1 Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)] NumPy 1.13.3 SciPy 1.0.0 gensim 2.3.0 FAST_VERSION 0