Closed adamdavidconn closed 4 years ago
Hello,
Yes, we use the coherence score from Eq. 1 in the paper "Optimizing Semantic Coherence in Topic Models."
For the original paper, I wrote up a separate function for calculating the topic coherence (which I believe wasn't that efficient..) but we never incorporated it into the main code here since we usually use the total correlation .tcs
to measure topic quality.
If you'd like, you could put together a pull request for adding a topic coherence function to the code so that it could be done with the CorEx model.
What is an acceptable tcs
score/value?
I am getting coherence score using Gensims CoherenceModel
with the following code, this assumes you have documents
as your list of documents, each document being a list of tokens, and a trained corex model called corex_model
. The model has options for several coherence scores from the paper "Exploring the Space of Topic Coherence Measures", hope this is helpful!
from gensim.models.coherencemodel import CoherenceModel
from gensim import corpora
# Creating the term dictionary, where every unique term is assigned an index
dictionary = corpora.Dictionary(documents)
# Creating corpus using dictionary prepared above
corpus = [dictionary.doc2bow(doc) for doc in tqdm(documents)]
# Get top words for each topic from the trained corex model
topics = corex_model.get_topics(n_words=100)
corex_topic_words = [[word for word, tc in topic] for topic in topics]
# Get coherence score
cm_corex = CoherenceModel(topics=corex_topic_words, texts=documents, corpus=corpus, dictionary=dictionary, coherence='c_v')
cm_corex.get_coherence()
@kandloic Apologies for the late reply, it's been a hectic few months. We usually interpret the TCs relative to one another rather than an absolute score. Topics with higher TC will "explain" more about the collection of documents. If you keep adding topics and the overall TC isn't increasing that much (the sum of all the TCs), then that means the topics you have probably explain most of the word relations in your documents.
@felixcs1 for providing that example of calculating coherence via Gensim! We'll probably point people to that for now rather than implementing within CorEx directly
The answer here no longer works. Here's a solution that's not in a general form but can be altered.
from corextopic import corextopic as ct
from sklearn.feature_extraction.text import CountVectorizer
def get_corex_model(list_of_str, text_prep, num_topics=10):
"""
Train a CorEx topic model on a list of strings.
Args:
- list_of_str (list): List of strings to train the model on.
- text_prep (function): Text preprocessing function.
- num_topics (int): Number of topics to extract (default is 10).
Returns:
- topic_model: Trained CorEx topic model.
- vectorizer: Fitted CountVectorizer used for text preprocessing.
- corex_topic_words (list): List of lists containing top words for each topic.
- preprocessed_tweets (list): List of preprocessed tweet texts.
"""
# Keep only English language tweets
tweet_texts_filtered = filter_lang(list_of_str)
# Preprocessing step
preprocessed_tweets = list(set([text_prep(text) for text in tweet_texts_filtered]))
# Convert preprocessed text to a document-term matrix
vectorizer = CountVectorizer(max_features=10000, max_df=0.95, min_df=2, stop_words='english')
doc_term_matrix = vectorizer.fit_transform(preprocessed_tweets)
# Train the CorEx model
topic_model = ct.Corex(n_hidden=num_topics, seed=42)
topic_model.fit(doc_term_matrix)
# Get and print the topics
topics = topic_model.get_topics()
feature_names = vectorizer.get_feature_names_out() # Get feature names from vectorizer
corex_topic_words = []
for i, topic in enumerate(topics):
words, _, _ = zip(*topic)
topic_words = [feature_names[idx] for idx in words] # Convert token IDs to words
corex_topic_words.append(topic_words)
print(f"Topic {i+1}: {' '.join(topic_words)}")
return topic_model, vectorizer, corex_topic_words, preprocessed_tweets
from gensim.models.coherencemodel import CoherenceModel
from gensim import corpora
# get_replaced_tweets() returns a list of strings
# preprocess_text() returns a list of str
# filter_lang() returns a list of strings
def get_coherence_score(topic_model, vectorizer, corex_topic_words, preprocessed_tweets):
"""
Calculate coherence score for a CorEx topic model.
Args:
- topic_model: Trained CorEx topic model.
- vectorizer: Fitted CountVectorizer used for text preprocessing.
- corex_topic_words (list): List of lists containing top words for each topic.
- preprocessed_tweets (list): List of preprocessed tweet texts.
Returns:
- coherence_score: Coherence score for the given topic model.
"""
# Tokenize the text
preprocessed_tweets_tokens = [text.split() for text in preprocessed_tweets]
# Creating the term dictionary, where every unique term is assigned an index
dictionary = corpora.Dictionary(preprocessed_tweets_tokens)
# Creating corpus using dictionary prepared above
corpus = [dictionary.doc2bow(doc) for doc in preprocessed_tweets_tokens]
# Get coherence score
cm_corex = CoherenceModel(topics=corex_topic_words, texts=preprocessed_tweets_tokens, corpus=corpus, dictionary=dictionary, coherence='c_v')
return cm_corex.get_coherence()
get_coherence_score(*get_corex_model(get_replaced_tweets(), preprocess_text_m2))
Hi,
Thank you for the great package.
I noticed in your paper that you measure the coherence scores of corex outputs (https://www.aclweb.org/anthology/Q17-1037.pdf)
However, in the class I do not see a method to output the coherence values. Could you point me in the right direction?
Thanks in adance!
Adam