Open eltarrk opened 5 years ago
@Kaotic-Kiwi Can you please explain what happened here?
I am facing the exact same issue while using LDAMalletModel. Would you please provide the solution?
My function to create multiple models and stores multiple values in a list
def compute_coherence_score(dictionary, corpus, texts, limit, start, step):
"""Compute Coherence score for different values of num of topics"""
coherence_scores, model_list = [],[]
for num_topics in range(start,limit,step):
model = gensim.models.wrappers.LdaMallet(mallet_path=mallet_path, corpus = corpus,id2word=id2word, num_topics= num_topics)
model_list.append(model)
coherencescore = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_scores.append(coherencescore.get_coherence())
return model_list, coherence_scores
Function Call
model_list , coherence_scores = compute_coherence_score(dictionary=id2word,texts=data_words,corpus=corpus,limit=100,start=50,step=10)
print(model_list)
print(coherence_scores)
Error Message: Resulting the coherence_scores =[nan,nan,nan,nan] (with all NaN values)
/usr/local/lib/python3.6/dist-packages/gensim/topic_coherence/direct_confirmation_measure.py:204: RuntimeWarning: divide by zero encountered in double_scalars
m_lr_i = np.log(numerator / denominator)
/usr/local/lib/python3.6/dist-packages/gensim/topic_coherence/indirect_confirmation_measure.py:323: RuntimeWarning: invalid value encountered in double_scalars
return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))
@kaotic-Kiwi @mpenkov Can you please inform if there is any solution for this?
I'm getting a similar error using the default gensim LDA implementation.
I did notice that there is a certain combination of topics (in LDA) and topn (in CoherenceModel) settings that can let the coherence calculation go through, for example if I have 30 topics and topn=2 I make it through the calculation.
Any thoughts? Perhaps this is a numerical stability issue?
ps: interestingly with the other window methods ‘c_uci’ and ‘c_npmi’ I get inf instead of nan
I am getting exactly the same error as @Kaotic-Kiwi when I try to calculate the coherence with c_v. Could somebody please help us or reopen the issue?
I'm wondering if this does not come from adding epsilon to the numerator rather than the denominator l.202-203 in topic_coherence/direct_confirmation_measure.py
:
numerator = (co_occur_count / num_docs) + EPSILON
denominator = (w_prime_count / num_docs) * (w_star_count / num_docs)
m_lr_i = np.log(numerator / denominator)
Adding +EPSILON
to the denominator removes the warning+NaN coherence result for me.
[EDIT]: does remove the first warning, but not the second (RuntimeWarning: invalid value encountered in double_scalars
) - I'll look into this.
@HaukeT @aschoenauer-sebag the topic_coherence
is a contributed module and its quality may be iffy.
If you're able to fix the issue and open a clean clear PR that'd be great.
Hello, I wrote another code, it seems to do the job for me. Apparently, using the parameter corpus instead of the parameter dictionary doesn't create any errors. I think coherence='c_v' doesn't like to be called with the dictionary parameter. I don't quite undertand why.
def LdaPipeline(train_set, test_set, k):
dictionary = gensim.corpora.Dictionary(train_set)
corpus_train = [dictionary.doc2bow(doc) for doc in train_set]
corpus_test = [dictionary.doc2bow(doc) for doc in test_set]
# LDA
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus_train, id2word = dictionary, num_topics = k, passes=30, alpha='auto')
# Perplexity
perplexity = lda_model.log_perplexity(corpus_test)
# Coherence
coherence_model = gensim.models.coherencemodel.CoherenceModel(model=lda_model, corpus=corpus_train, texts = train_set, coherence='c_v')
coherence = coherence_model.get_coherence()
return [perplexity, coherence]
Thank you for reopening the issue and the replies. The work around from @Kaotic-Kiwi to only use the parameter corpus and avoid the parameter dictionary did not work for my data. I will try to find an error in my data.
In my case, this error will happen when I try to pass my prior eta
to the model. My eta
is a numpy.ndarray
with the shape of (num topics, num terms). I initialize eta
with the value of 1/(num topics) and transfer some prior to top-n rows.
e.g. 3 topics and the first row is my prior:
[ [18, 63, 52, 5, 0, 145], [1/3, 1/3, 1/3, 1/3, 1/3, 1/3], [1/3, 1/3, 1/3, 1/3, 1/3, 1/3]]
with the more num of prior I transfer, the topic coherence will get nan
in the calculation. (e.g. 30 topics, transfer 10)
Does not work for me either. I am using LDAMallet and using corpus instead of dictionary parameter as per @Kaotic-Kiwi advice did not help to solve the issue, unfortunately.
I get this error when switching to corpus parameter:
text_analysis.py in _ids_to_words(ids, dictionary)
55
56 """
---> 57 if not dictionary.id2token: # may not be initialized in the standard gensim.corpora.Dictionary
58 setattr(dictionary, 'id2token', {v: k for k, v in dictionary.token2id.items()})
59
AttributeError: 'dict' object has no attribute 'id2token'
Using u_mass
solves the issue, although this is a different metric.
coherencemodel = CoherenceModel(model=model, texts=docs, corpus=corpus, coherence='u_mass')
@kdubovikov What is the full traceback?
I wonder if the dictionary
in the code you show is allowed to be a plain dict
, or must be gensim.corpora.Dictionary
.
Has anyone found a solution to this problem? I'm still in the dark here. Using gensim version 3.8.3. When calculating coherence value over training data it all works fine. When calculating coherence value over the test data, it does give a nan value as output for about 50% of the topics, while the other topics are calculated properly.
In my case, the error is caused by certain topic words not appearing in the test datasets. No error after removing this word from the topic words.
Wouldn't that create unrepresentative coherence scores? @RayLei
in my case, the error is caused by several null text datasets (parameter texts). so, i cleanup texts datasets and rebuilt coherence model, finally get_coherence() return coherence score
Problem description
For my intership, I'm trying to evaluate the quality of different LDA models using both perplexity and coherence. I used a loop and generated each model. However, I encounter a problem when the loop reaches parameter_value=15. The coherence score seems to be stored as nan for the models 15, 20, 25 and 30... I tried fixing this issue by changind the parameters in .LdaModel() but it only makes the warning appear for further models. Instead of having a warning for parameter_value=15, I get it for parameter_value=30.
Can someone please help me ?
Problem encountered : warning
Steps/code
Versions