piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.57k stars 4.37k forks source link

Phraser max NPMI score > 1 #3042

Open joachimdb opened 3 years ago

joachimdb commented 3 years ago

Problem description

I trained a NMPI phraser on the latest wikipedia dump. It is my understanding that scores should be <= 1.0, but I get a higher score.

Steps/code/corpus to reproduce

from gensim.corpora import WikiCorpus
from gensim.models import Phrases
from gensim.models.phrases import Phraser

wiki_corpus = WikiCorpus("enwiki-latest-pages-articles-multistream.xml.bz2", dictionary={})

ENGLISH_CONNECTOR_WORDS = frozenset(
    " a an the "  # articles; we never care about these in MWEs
    " for of with without at from to in on by "  # prepositions; incomplete on purpose, to minimize FNs
    " and or "  # conjunctions; incomplete on purpose, to minimize FNs
    .split()
)

phrases = Phrases(wiki_corpus.get_texts(), scoring='npmi', threshold=0.75, min_count=5, common_terms=ENGLISH_CONNECTOR_WORDS, max_vocab_size=80000000)
phraser = Phraser(phrases)

Then:

In[2]: max(phraser.phrasegrams.values())
Out[2]: 1.2003355030351979

Versions

Linux-3.10.0-1160.6.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core
Python 3.7.9 (default, Aug 31 2020, 12:42:55)
[GCC 7.3.0]
Bits 64
NumPy 1.19.2
gensim 3.8.0
FAST_VERSION 1
piskvorky commented 3 years ago

Yeah that's weird. AFAIR the NMPI scores should be in <-1, 1>. Can you check in Gensim 4.0.0 please? (pip install -U gensim)

Could you inspect what the underlying words and word counts are, for the affected bigram? Maybe that will shed some light, help us debug. Thanks.

piskvorky commented 3 years ago

Also, looking at the npmi docs, I don't understand why the formula talks about prop (?), but then refers to prob on the same line. Weird too.

EDIT: that formula seems to have been introduced in https://github.com/RaRe-Technologies/gensim/commit/5677ab300e4e3dc4645806762b693902b97c13c3#diff-b792e36e52289f193a1ef84cc9f58884b95dc1a29bdb21ad8f7769daf0a3dbb0R670 . I'm leaning toward a simple typo – reviews were more lax at that time than they are now.