piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.67k stars 4.38k forks source link

Coherence RuntimeWarnings : divide by zero encountered in double_scalars AND invalid value encountered in double_scalars #2463

Open eltarrk opened 5 years ago

eltarrk commented 5 years ago

Problem description

For my intership, I'm trying to evaluate the quality of different LDA models using both perplexity and coherence. I used a loop and generated each model. However, I encounter a problem when the loop reaches parameter_value=15. The coherence score seems to be stored as nan for the models 15, 20, 25 and 30... I tried fixing this issue by changind the parameters in .LdaModel() but it only makes the warning appear for further models. Instead of having a warning for parameter_value=15, I get it for parameter_value=30.

Can someone please help me ?

Problem encountered : warning

starting pass for parameter_value = 30.000
Elapsed time: 1.6870347789972584
Perplexity score: -13.63168019880968
C:\Users\straw\Anaconda3\lib\site-packages\gensim\topic_coherence\direct_confirmation_measure.py:204: RuntimeWarning: divide by zero encountered in double_scalars
  m_lr_i = np.log(numerator / denominator)
C:\Users\straw\Anaconda3\lib\site-packages\gensim\topic_coherence\indirect_confirmation_measure.py:323: RuntimeWarning: invalid value encountered in double_scalars
  return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))
Coherence Score: nan

Steps/code


grid_flt = defaultdict(list)

 # num topics
parameter_list=[2, 5, 10, 15, 20, 25, 30]

for parameter_value in parameter_list:
    print("starting pass for parameter_value = %.3f" % parameter_value)
    start_time = timeit.default_timer()
    # run model
    ldamodel_train_flt = gensim.models.ldamodel.LdaModel(corpus=doc_term_matrix_train_flt, id2word = dictionary_train_flt, num_topics = parameter_value, passes=25, per_word_topics=True)

    # show elapsed time for model
    elapsed = timeit.default_timer() - start_time
    print("Elapsed time: %s" % elapsed)

    # Compute perplexity
    perplex =  ldamodel_train_flt.log_perplexity(doc_term_matrix_test_flt)
    print("Perplexity score: %s" % perplex)
    grid_flt[parameter_value].append(perplex)

    # Compute Coherence Score
    coherence_model_lda = gensim.models.coherencemodel.CoherenceModel(model=ldamodel_train_flt, texts=list_of_docs_flt_test, dictionary=dictionary_train_flt, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    print("Coherence Score: %s" % coherence_lda)
    grid_flt[parameter_value].append(coherence_lda)

Versions

Windows-10-10.0.17134-SP0
Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
NumPy 1.16.2
SciPy 1.1.0
gensim 3.7.2
FAST_VERSION -1
mpenkov commented 5 years ago

@Kaotic-Kiwi Can you please explain what happened here?

harshshah-work commented 4 years ago

I am facing the exact same issue while using LDAMalletModel. Would you please provide the solution?

My function to create multiple models and stores multiple values in a list

def compute_coherence_score(dictionary, corpus, texts, limit, start, step):
  """Compute Coherence score for different values of num of topics"""
  coherence_scores, model_list = [],[]
  for num_topics in range(start,limit,step):
    model = gensim.models.wrappers.LdaMallet(mallet_path=mallet_path, corpus = corpus,id2word=id2word, num_topics= num_topics)
    model_list.append(model)
    coherencescore = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
    coherence_scores.append(coherencescore.get_coherence())

  return model_list, coherence_scores

Function Call

model_list , coherence_scores = compute_coherence_score(dictionary=id2word,texts=data_words,corpus=corpus,limit=100,start=50,step=10)
print(model_list)
print(coherence_scores)

Error Message: Resulting the coherence_scores =[nan,nan,nan,nan] (with all NaN values)

/usr/local/lib/python3.6/dist-packages/gensim/topic_coherence/direct_confirmation_measure.py:204: RuntimeWarning: divide by zero encountered in double_scalars
  m_lr_i = np.log(numerator / denominator)
/usr/local/lib/python3.6/dist-packages/gensim/topic_coherence/indirect_confirmation_measure.py:323: RuntimeWarning: invalid value encountered in double_scalars
  return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))

@kaotic-Kiwi @mpenkov Can you please inform if there is any solution for this?

NickRothbacher commented 4 years ago

I'm getting a similar error using the default gensim LDA implementation.

I did notice that there is a certain combination of topics (in LDA) and topn (in CoherenceModel) settings that can let the coherence calculation go through, for example if I have 30 topics and topn=2 I make it through the calculation.

Any thoughts? Perhaps this is a numerical stability issue?

ps: interestingly with the other window methods ‘c_uci’ and ‘c_npmi’ I get inf instead of nan

ContainerEnjoyer commented 4 years ago

I am getting exactly the same error as @Kaotic-Kiwi when I try to calculate the coherence with c_v. Could somebody please help us or reopen the issue?

aschoenauer-sebag commented 4 years ago

I'm wondering if this does not come from adding epsilon to the numerator rather than the denominator l.202-203 in topic_coherence/direct_confirmation_measure.py :

numerator = (co_occur_count / num_docs) + EPSILON
denominator = (w_prime_count / num_docs) * (w_star_count / num_docs)
m_lr_i = np.log(numerator / denominator)

Adding +EPSILON to the denominator removes the warning+NaN coherence result for me. [EDIT]: does remove the first warning, but not the second (RuntimeWarning: invalid value encountered in double_scalars) - I'll look into this.

piskvorky commented 4 years ago

@HaukeT @aschoenauer-sebag the topic_coherence is a contributed module and its quality may be iffy.

If you're able to fix the issue and open a clean clear PR that'd be great.

eltarrk commented 4 years ago

Hello, I wrote another code, it seems to do the job for me. Apparently, using the parameter corpus instead of the parameter dictionary doesn't create any errors. I think coherence='c_v' doesn't like to be called with the dictionary parameter. I don't quite undertand why.

def LdaPipeline(train_set, test_set, k):

dictionary = gensim.corpora.Dictionary(train_set)
corpus_train = [dictionary.doc2bow(doc) for doc in train_set]
corpus_test = [dictionary.doc2bow(doc) for doc in test_set] 
# LDA
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus_train, id2word = dictionary, num_topics = k, passes=30, alpha='auto')
# Perplexity
perplexity = lda_model.log_perplexity(corpus_test)
# Coherence
coherence_model = gensim.models.coherencemodel.CoherenceModel(model=lda_model, corpus=corpus_train, texts = train_set, coherence='c_v')
coherence = coherence_model.get_coherence()
return [perplexity, coherence]
ContainerEnjoyer commented 4 years ago

Thank you for reopening the issue and the replies. The work around from @Kaotic-Kiwi to only use the parameter corpus and avoid the parameter dictionary did not work for my data. I will try to find an error in my data.

rocknamx8 commented 3 years ago

In my case, this error will happen when I try to pass my prior eta to the model. My eta is a numpy.ndarray with the shape of (num topics, num terms). I initialize eta with the value of 1/(num topics) and transfer some prior to top-n rows. e.g. 3 topics and the first row is my prior: [ [18, 63, 52, 5, 0, 145], [1/3, 1/3, 1/3, 1/3, 1/3, 1/3], [1/3, 1/3, 1/3, 1/3, 1/3, 1/3]] with the more num of prior I transfer, the topic coherence will get nan in the calculation. (e.g. 30 topics, transfer 10)

kdubovikov commented 3 years ago

Does not work for me either. I am using LDAMallet and using corpus instead of dictionary parameter as per @Kaotic-Kiwi advice did not help to solve the issue, unfortunately.

I get this error when switching to corpus parameter:

text_analysis.py in _ids_to_words(ids, dictionary)
     55 
     56     """
---> 57     if not dictionary.id2token:  # may not be initialized in the standard gensim.corpora.Dictionary
     58         setattr(dictionary, 'id2token', {v: k for k, v in dictionary.token2id.items()})
     59 

AttributeError: 'dict' object has no attribute 'id2token'

Using u_mass solves the issue, although this is a different metric.

coherencemodel = CoherenceModel(model=model, texts=docs, corpus=corpus, coherence='u_mass')
piskvorky commented 3 years ago

@kdubovikov What is the full traceback?

I wonder if the dictionary in the code you show is allowed to be a plain dict, or must be gensim.corpora.Dictionary.

job-almekinders commented 3 years ago

Has anyone found a solution to this problem? I'm still in the dark here. Using gensim version 3.8.3. When calculating coherence value over training data it all works fine. When calculating coherence value over the test data, it does give a nan value as output for about 50% of the topics, while the other topics are calculated properly.

RayLei commented 3 years ago

In my case, the error is caused by certain topic words not appearing in the test datasets. No error after removing this word from the topic words.

job-almekinders commented 3 years ago

Wouldn't that create unrepresentative coherence scores? @RayLei

ekopermonojati commented 2 years ago

in my case, the error is caused by several null text datasets (parameter texts). so, i cleanup texts datasets and rebuilt coherence model, finally get_coherence() return coherence score