piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.64k stars 4.37k forks source link

Dictionary Error #2507

Closed perambulate closed 5 years ago

perambulate commented 5 years ago

Problem description

I got this bug that didn't happen before. Basically, I want to compute coherence score for each topic I've gathered from different LDA models.

Steps/code/corpus to reproduce

data = pd.read_csv("hasil_praproses_2.csv")
texts = [doc.split() for doc in data['finals'].values.tolist()]

dictionary = Dictionary(texts)
dictionary.filter_extremes(no_below=5, no_above=0.9)
dictionary.compactify()
dictionary.save("dictionary_unvised.dict")
dictionary = Dictionary.load("dictionary_unvised.dict")
corpus = [dictionary.doc2bow(text) for text in texts]

from gensim.models.ldamodel import LdaModel
from gensim.models.coherencemodel import CoherenceModel

import logging
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO

passes = 20
iterations = 800
n_topics = [5,6,7]

for n in n_topics:    
    lda = LdaModel(corpus,
                   num_topics=n,
                   id2word=dictionary,
                   passes=passes,
                   iterations=iterations,
                   random_state=7,
                   alpha='auto',
                   eta='auto')

    topic_bow = lda.show_topics(num_topics=n, num_words = 20, formatted=False)
    top = [[w[0] for w in topic] for num, topic in topic_bow]
    topics_list.append(top)

flat_topics = [topic for topics in topics_list for topic in topics]

cm = CoherenceModel(topics=flat_topics[0], corpus=corpus dictionary=dictionary, coherence='u_mass')
cm.get_coherence()

An example of one of the topics:

flat_topics[0]
['traffic', 'improve', 'good', 'work', 'growth', 'mopac', 'infrastructure', 'downtown', 'issue', 'public_transportation', 'affordable_housing', 'problem', 'traffic_flow', 'plan', 'traffic_congestion', 'planning', 'fix_traffic', 'transportation', 'major', 'business']

And I keep getting this error:

KeyError                                  Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/gensim/models/coherencemodel.py in _ensure_elements_are_ids(self, topic)
    342         try:
--> 343             return np.array([self.dictionary.token2id[token] for token in topic])
    344         except KeyError:  # might be a list of token ids already, but let's verify all in dict

~/anaconda3/lib/python3.6/site-packages/gensim/models/coherencemodel.py in <listcomp>(.0)
    342         try:
--> 343             return np.array([self.dictionary.token2id[token] for token in topic])
    344         except KeyError:  # might be a list of token ids already, but let's verify all in dict

KeyError: 't'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-30-8aee651e4ed2> in <module>()
      2                       corpus=corpus,
      3                       dictionary=dictionary,
----> 4                       coherence='u_mass')
      5 coba.get_coherence()

~/anaconda3/lib/python3.6/site-packages/gensim/models/coherencemodel.py in __init__(self, model, topics, texts, corpus, dictionary, window_size, keyed_vectors, coherence, topn, processes)
    215         self._accumulator = None
    216         self._topics = None
--> 217         self.topics = topics
    218 
    219         self.processes = processes if processes >= 1 else max(1, mp.cpu_count() - 1)

~/anaconda3/lib/python3.6/site-packages/gensim/models/coherencemodel.py in topics(self, topics)
    323             new_topics = []
    324             for topic in topics:
--> 325                 topic_token_ids = self._ensure_elements_are_ids(topic)
    326                 new_topics.append(topic_token_ids)
    327 

~/anaconda3/lib/python3.6/site-packages/gensim/models/coherencemodel.py in _ensure_elements_are_ids(self, topic)
    343             return np.array([self.dictionary.token2id[token] for token in topic])
    344         except KeyError:  # might be a list of token ids already, but let's verify all in dict
--> 345             topic = [self.dictionary.id2token[_id] for _id in topic]
    346             return np.array([self.dictionary.token2id[token] for token in topic])
    347 

~/anaconda3/lib/python3.6/site-packages/gensim/models/coherencemodel.py in <listcomp>(.0)
    343             return np.array([self.dictionary.token2id[token] for token in topic])
    344         except KeyError:  # might be a list of token ids already, but let's verify all in dict
--> 345             topic = [self.dictionary.id2token[_id] for _id in topic]
    346             return np.array([self.dictionary.token2id[token] for token in topic])
    347 

KeyError: 't'

Versions

Please provide the output of:

Linux-4.15.0-1032-gcp-x86_64-with-debian-stretch-sid
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0]
NumPy 1.14.3
SciPy 1.1.0
gensim 3.4.0
FAST_VERSION 1
timpowellgit commented 5 years ago

Should this actually not get flat_topics (all of the topics) rather than the first topic only.

cm = CoherenceModel(topics=flat_topics[0], corpus=corpus dictionary=dictionary, coherence='u_mass')

Otherwise try CoherenceModel(topics=[flat_topics[0]]

mpenkov commented 5 years ago

@desthalia Could you please use a public corpus to reproduce the problem? Without access to your data, we cannot reproduce your problem on our side.

mpenkov commented 5 years ago

Closing due to inactivity