piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.64k stars 4.37k forks source link

HdpModel inference doesn't sum to 1 for each document #2849

Open Davide95 opened 4 years ago

Davide95 commented 4 years ago

Problem description

Inference should gives you a probability distribution from a dirichlet, it instead gives you a series of numbers that doesn't add to 1

Steps/code/corpus to reproduce

hdp = HdpModel(corpus, vocab)
doctopic = corpus2dense(hdp[corpus], num_terms=hdp.m_T,
                            num_docs=bow_data.shape[0],
                            dtype=np.float32)
print(np.sum(doctopic[:,0]))

Versions

>>> import sys; print("Python", sys.version)
Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.18.1
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.4.1
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.8.0
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 1
Davide95 commented 4 years ago

Edit: there is an undocumented parameter eps in the __getitem__ method that should NEVER be there due to the fact that the overload of the operator allows only one parameter in Python. Also, the eps parameter doesn't work when the number of documents passed is > 1 since that parameter is not passed to the _apply function.