piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.66k stars 4.38k forks source link

Phrases.analyse_sentence() performs a greedy search for phrases #3367

Open chaturv3di opened 2 years ago

chaturv3di commented 2 years ago

Problem description

When computing phrases, it is desirable that Phrases.analyze_sentence() implement a certain look-ahead and return phrases with higher scores. Let's say we have bigrams ('a_b', 0.25) and ('b_c', 0.45) and a s = ['a', 'b', 'c']. At present, the return value of phrases.analyze_sentence(s) is going to be:

[('a_b', 0.25),
 ('c'), None]

although, since the second bigram has a higher score, it makes sense for the return value to be:

[('a', None),
 ('b_c'), 0.45]

Is there value in implementing this, optimised version of analyze_sentence()?

[Update] PS: Would it be okay to (eventually) raise a PR for this?

Versions

Linux-4.19.0-21-cloud-amd64-x86_64-with-debian-10.12
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) 
[GCC 9.4.0]
Bits 64
NumPy 1.19.5
SciPy 1.7.3
gensim 4.2.0
FAST_VERSION 0
gojomo commented 2 years ago

This is essentially a duplicate of #1719, which includes some discussion (including considerations when multiple Phrases models are stacked). I'll update its title to make it easier to find.