Open joachimdb opened 3 years ago
Yeah that's weird. AFAIR the NMPI scores should be in <-1, 1>. Can you check in Gensim 4.0.0 please? (pip install -U gensim
)
Could you inspect what the underlying words and word counts are, for the affected bigram? Maybe that will shed some light, help us debug. Thanks.
Also, looking at the npmi docs, I don't understand why the formula talks about prop
(?), but then refers to prob
on the same line. Weird too.
EDIT: that formula seems to have been introduced in https://github.com/RaRe-Technologies/gensim/commit/5677ab300e4e3dc4645806762b693902b97c13c3#diff-b792e36e52289f193a1ef84cc9f58884b95dc1a29bdb21ad8f7769daf0a3dbb0R670 . I'm leaning toward a simple typo – reviews were more lax at that time than they are now.
Problem description
I trained a NMPI phraser on the latest wikipedia dump. It is my understanding that scores should be <= 1.0, but I get a higher score.
Steps/code/corpus to reproduce
Then:
Versions