When computing phrases, it is desirable that Phrases.analyze_sentence() implement a certain look-ahead and return phrases with higher scores. Let's say we have bigrams ('a_b', 0.25) and ('b_c', 0.45) and a s = ['a', 'b', 'c']. At present, the return value of phrases.analyze_sentence(s) is going to be:
[('a_b', 0.25),
('c'), None]
although, since the second bigram has a higher score, it makes sense for the return value to be:
[('a', None),
('b_c'), 0.45]
Is there value in implementing this, optimised version of analyze_sentence()?
[Update] PS: Would it be okay to (eventually) raise a PR for this?
This is essentially a duplicate of #1719, which includes some discussion (including considerations when multiple Phrases models are stacked). I'll update its title to make it easier to find.
Problem description
When computing phrases, it is desirable that
Phrases.analyze_sentence()
implement a certain look-ahead and return phrases with higher scores. Let's say we have bigrams('a_b', 0.25)
and('b_c', 0.45)
and as = ['a', 'b', 'c']
. At present, the return value ofphrases.analyze_sentence(s)
is going to be:although, since the second bigram has a higher score, it makes sense for the return value to be:
Is there value in implementing this, optimised version of
analyze_sentence()
?[Update] PS: Would it be okay to (eventually) raise a PR for this?
Versions