piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.55k stars 4.37k forks source link

Zero probabilities in LDA model #2418

Open piskvorky opened 5 years ago

piskvorky commented 5 years ago

Problem description

A user reported "empty" topics (all probabilities zero), during LdaModel training: https://groups.google.com/forum/#!topic/gensim/LuPD2VSouSQ

Apparently some of the recent optimizations in #1656 (and maybe elsewhere?) introduced numeric instabilities.

Steps/code/corpus to reproduce

Unknown. Probably related to large data size: large vocabulary in combination with large number of topics, leading to float32 under/overflows.

User reported that changing the dtype back to float64 helped and the "empty topics" problem went away.

Versions

Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
enys commented 5 years ago

Hi @piskvorky,

Apparently it was one of my team members. Please find attached the output:

Python 3.6.8 (default, Mar 15 2019, 14:14:12)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform; print(platform.platform())
Linux-4.4.0-130-generic-x86_64-with-debian-stretch-sid
>>> import sys; print("Python", sys.version)
Python 3.6.8 (default, Mar 15 2019, 14:14:12)
[GCC 5.4.0 20160609]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.16.2
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.2.1
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.7.1
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.2.1
enys commented 5 years ago

It is indeed with a combination of large vocabulary + topics. Topics 500 and 1000 suffer the problem. Our dict size is > 300K. We also use online updates chunks of 100K documents, with a target total corpus size of 50M

horpto commented 5 years ago

Hi, @enys Can you share a minimal dataset that reproduce the problem?

enys commented 5 years ago

Quick answer, no. Dictionary contains 519K corpus is built from precalculated bagwords. I will paste my parameters. I could try ton build a random corpus/dict if there is a high probability that it is due to cardinality. I have a run computing over the weekend.

horpto commented 5 years ago

Quite stupid question: how can a topic probabilities be a fully zeros if show_topics normalizes topic row ? https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/ldamodel.py#L1163

enys commented 5 years ago

Hi @horpto, Sorry for the late reply. Fully might be a slight over statement, however it renders :

[(178, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(299, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(281, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(208, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(485, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(72, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(65, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(332, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(267, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(75, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),
horpto commented 5 years ago

@enys, sorry for late response, Lda.show_topics shows only the first 10 words from the topic by default and moreover, it rounds topic probabilities to 3 digits after the point. Can you show topics with formatted=False and quite large num_words parameter?

piskvorky commented 4 years ago

Ping @enys are you able to share a reproducible example? We'd like to get to the bottom of this.

davidalbertonogueira commented 4 years ago

I have this issue (zero probabilities for words in show_topics) only when using gensim.models.LdaMulticore. Output of gensim.models.ldamodel.LdaModel is as expected.

piskvorky commented 4 years ago

@davidalbertonogueira Same comments as above apply.

davidalbertonogueira commented 4 years ago

I'm sorry but my current dataset is proprietary. I reckon that I could try to create a small example that generates the same error, but I would have to do it with online publically available data, and therefore, there's no point in doing that myself.

I share the dimensions in case it helps someone trying to replicate the error: len(gensim_corpus) = 109'000 len(gensim_dictionary) = 13989 n_topics = 10

piskvorky commented 4 years ago

@davidalbertonogueira that seems different from the issue reported here, which had a huge (500k) vocabulary and lots of topics (1000). In your case, you have only 14k vocab + 10 topics. Likely unrelated, a separate issue.

davidalbertonogueira commented 4 years ago

Should I open a new issue then? @piskvorky

piskvorky commented 4 years ago

Only if you're able to include the reproducing example :) Otherwise there isn't much we'll be able to do anyway. Thanks.

SphtKr commented 3 years ago

I have far less experience than the other reporters (i.e. it could be something I'm doing wrong), but I'm seeing the same thing--one or more topics with near-zero probabilities, and the terms are usually alphabetically contiguous. My corpus is derived from the Yelp Dataset Challenge licensed for academic use...I may be able to share the contents but unsure, I'll have to read closely. However, it's also very small and I'm doing a small number of topics (10-100)...again, could be something naive I'm doing.

My code looks like this... the very low max_df is on purpose as I was trying to get rid of lots of irrelevant features cheaply... if nothing else looks stupid, I can try to contribute a reproduction.

    vectorizer = TfidfVectorizer(max_df=0.1, max_features=numfeatures,
                                     min_df=2, stop_words='english',
                                     use_idf=True, ngram_range=(1,3))

    X = vectorizer.fit_transform(text)

    id2words ={}
    for i,word in enumerate(vectorizer.get_feature_names()):
        id2words[i] = word

    corpus = matutils.Sparse2Corpus(X,  documents_columns=False)

    lda = models.ldamodel.LdaModel(corpus=corpus,
                                   id2word=id2words,
                                   num_topics=10, 
                                   update_every=1,
                                   chunksize=100,
                                   passes=5,
                                   alpha='auto',
                                   per_word_topics=True)

In the latest run that runs over 74310 documents with 100000 features.

Then I dump the topics to a text file (among other things) and my "empty" topic looks like this:

Topic: 4
chino : 1e-05
pei wei : 1e-05
taguara : 1e-05
gelato spot : 1e-05
jade : 1e-05
la taguara : 1e-05
wei : 1e-05
place week : 9.999999e-06
place way priced : 9.999999e-06
place welcoming : 9.999999e-06
place welcome : 9.999999e-06
place weird : 9.999999e-06
place weeks : 9.999999e-06
place weekend : 9.999999e-06
place went dinner : 9.999999e-06
SphtKr commented 3 years ago

Here's the requested version output, sorry, missed that:

macOS-10.16-x86_64-i386-64bit
Python 3.8.6 (default, Nov 11 2020, 13:20:43) 
[Clang 12.0.0 (clang-1200.0.32.21)]
NumPy 1.19.4
SciPy 1.6.3
gensim 4.0.1
FAST_VERSION 0