Open piskvorky opened 5 years ago
Hi @piskvorky,
Apparently it was one of my team members. Please find attached the output:
Python 3.6.8 (default, Mar 15 2019, 14:14:12)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform; print(platform.platform())
Linux-4.4.0-130-generic-x86_64-with-debian-stretch-sid
>>> import sys; print("Python", sys.version)
Python 3.6.8 (default, Mar 15 2019, 14:14:12)
[GCC 5.4.0 20160609]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.16.2
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.2.1
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.7.1
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.2.1
It is indeed with a combination of large vocabulary + topics. Topics 500 and 1000 suffer the problem. Our dict size is > 300K. We also use online updates chunks of 100K documents, with a target total corpus size of 50M
Hi, @enys Can you share a minimal dataset that reproduce the problem?
Quick answer, no. Dictionary contains 519K corpus is built from precalculated bagwords. I will paste my parameters. I could try ton build a random corpus/dict if there is a high probability that it is due to cardinality. I have a run computing over the weekend.
Quite stupid question: how can a topic probabilities be a fully zeros if show_topics
normalizes topic row ? https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/ldamodel.py#L1163
Hi @horpto, Sorry for the late reply. Fully might be a slight over statement, however it renders :
[(178, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),
(299, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),
(281, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),
(208, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),
(485, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),
(72, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),
(65, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),
(332, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),
(267, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),
(75, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),
@enys, sorry for late response, Lda.show_topics
shows only the first 10 words from the topic by default and moreover, it rounds topic probabilities to 3 digits after the point. Can you show topics with formatted=False
and quite large num_words
parameter?
Ping @enys are you able to share a reproducible example? We'd like to get to the bottom of this.
I have this issue (zero probabilities for words in show_topics) only when using gensim.models.LdaMulticore. Output of gensim.models.ldamodel.LdaModel is as expected.
@davidalbertonogueira Same comments as above apply.
I'm sorry but my current dataset is proprietary. I reckon that I could try to create a small example that generates the same error, but I would have to do it with online publically available data, and therefore, there's no point in doing that myself.
I share the dimensions in case it helps someone trying to replicate the error: len(gensim_corpus) = 109'000 len(gensim_dictionary) = 13989 n_topics = 10
@davidalbertonogueira that seems different from the issue reported here, which had a huge (500k) vocabulary and lots of topics (1000). In your case, you have only 14k vocab + 10 topics. Likely unrelated, a separate issue.
Should I open a new issue then? @piskvorky
Only if you're able to include the reproducing example :) Otherwise there isn't much we'll be able to do anyway. Thanks.
I have far less experience than the other reporters (i.e. it could be something I'm doing wrong), but I'm seeing the same thing--one or more topics with near-zero probabilities, and the terms are usually alphabetically contiguous. My corpus is derived from the Yelp Dataset Challenge licensed for academic use...I may be able to share the contents but unsure, I'll have to read closely. However, it's also very small and I'm doing a small number of topics (10-100)...again, could be something naive I'm doing.
My code looks like this... the very low max_df
is on purpose as I was trying to get rid of lots of irrelevant features cheaply... if nothing else looks stupid, I can try to contribute a reproduction.
vectorizer = TfidfVectorizer(max_df=0.1, max_features=numfeatures,
min_df=2, stop_words='english',
use_idf=True, ngram_range=(1,3))
X = vectorizer.fit_transform(text)
id2words ={}
for i,word in enumerate(vectorizer.get_feature_names()):
id2words[i] = word
corpus = matutils.Sparse2Corpus(X, documents_columns=False)
lda = models.ldamodel.LdaModel(corpus=corpus,
id2word=id2words,
num_topics=10,
update_every=1,
chunksize=100,
passes=5,
alpha='auto',
per_word_topics=True)
In the latest run that runs over 74310 documents with 100000 features.
Then I dump the topics to a text file (among other things) and my "empty" topic looks like this:
Topic: 4
chino : 1e-05
pei wei : 1e-05
taguara : 1e-05
gelato spot : 1e-05
jade : 1e-05
la taguara : 1e-05
wei : 1e-05
place week : 9.999999e-06
place way priced : 9.999999e-06
place welcoming : 9.999999e-06
place welcome : 9.999999e-06
place weird : 9.999999e-06
place weeks : 9.999999e-06
place weekend : 9.999999e-06
place went dinner : 9.999999e-06
Here's the requested version output, sorry, missed that:
macOS-10.16-x86_64-i386-64bit
Python 3.8.6 (default, Nov 11 2020, 13:20:43)
[Clang 12.0.0 (clang-1200.0.32.21)]
NumPy 1.19.4
SciPy 1.6.3
gensim 4.0.1
FAST_VERSION 0
Problem description
A user reported "empty" topics (all probabilities zero), during LdaModel training: https://groups.google.com/forum/#!topic/gensim/LuPD2VSouSQ
Apparently some of the recent optimizations in #1656 (and maybe elsewhere?) introduced numeric instabilities.
Steps/code/corpus to reproduce
Unknown. Probably related to large data size: large vocabulary in combination with large number of topics, leading to float32 under/overflows.
User reported that changing the
dtype
back to float64 helped and the "empty topics" problem went away.Versions
Please provide the output of: