gregversteeg / corex_topic

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx
Apache License 2.0
627 stars 120 forks source link

Coherence Scores #36

Closed adamdavidconn closed 4 years ago

adamdavidconn commented 4 years ago

Hi,

Thank you for the great package.

I noticed in your paper that you measure the coherence scores of corex outputs (https://www.aclweb.org/anthology/Q17-1037.pdf)

However, in the class I do not see a method to output the coherence values. Could you point me in the right direction?

Thanks in adance!

Adam

ryanjgallagher commented 4 years ago

Hello,

Yes, we use the coherence score from Eq. 1 in the paper "Optimizing Semantic Coherence in Topic Models."

For the original paper, I wrote up a separate function for calculating the topic coherence (which I believe wasn't that efficient..) but we never incorporated it into the main code here since we usually use the total correlation .tcs to measure topic quality.

If you'd like, you could put together a pull request for adding a topic coherence function to the code so that it could be done with the CorEx model.

kandloic commented 4 years ago

What is an acceptable tcs score/value?

felixcs1 commented 4 years ago

I am getting coherence score using Gensims CoherenceModel with the following code, this assumes you have documents as your list of documents, each document being a list of tokens, and a trained corex model called corex_model. The model has options for several coherence scores from the paper "Exploring the Space of Topic Coherence Measures", hope this is helpful!

from gensim.models.coherencemodel import CoherenceModel
from gensim import corpora

# Creating the term dictionary, where every unique term is assigned an index
dictionary = corpora.Dictionary(documents)

# Creating corpus using dictionary prepared above
corpus = [dictionary.doc2bow(doc) for doc in tqdm(documents)]

# Get top words for each topic from the trained corex model
topics = corex_model.get_topics(n_words=100)
corex_topic_words = [[word for word, tc in topic] for topic in topics]

# Get coherence score
cm_corex = CoherenceModel(topics=corex_topic_words, texts=documents, corpus=corpus, dictionary=dictionary, coherence='c_v')
cm_corex.get_coherence()
ryanjgallagher commented 4 years ago

@kandloic Apologies for the late reply, it's been a hectic few months. We usually interpret the TCs relative to one another rather than an absolute score. Topics with higher TC will "explain" more about the collection of documents. If you keep adding topics and the overall TC isn't increasing that much (the sum of all the TCs), then that means the topics you have probably explain most of the word relations in your documents.

@felixcs1 for providing that example of calculating coherence via Gensim! We'll probably point people to that for now rather than implementing within CorEx directly

blehman commented 7 months ago

The answer here no longer works. Here's a solution that's not in a general form but can be altered.

from corextopic import corextopic as ct
from sklearn.feature_extraction.text import CountVectorizer

def get_corex_model(list_of_str, text_prep, num_topics=10):
    """
    Train a CorEx topic model on a list of strings.

    Args:
    - list_of_str (list): List of strings to train the model on.
    - text_prep (function): Text preprocessing function.
    - num_topics (int): Number of topics to extract (default is 10).

    Returns:
    - topic_model: Trained CorEx topic model.
    - vectorizer: Fitted CountVectorizer used for text preprocessing.
    - corex_topic_words (list): List of lists containing top words for each topic.
    - preprocessed_tweets (list): List of preprocessed tweet texts.
    """
    # Keep only English language tweets 
    tweet_texts_filtered = filter_lang(list_of_str)

    # Preprocessing step 
    preprocessed_tweets = list(set([text_prep(text) for text in tweet_texts_filtered]))

    # Convert preprocessed text to a document-term matrix
    vectorizer = CountVectorizer(max_features=10000, max_df=0.95, min_df=2, stop_words='english')
    doc_term_matrix = vectorizer.fit_transform(preprocessed_tweets)

    # Train the CorEx model
    topic_model = ct.Corex(n_hidden=num_topics, seed=42)
    topic_model.fit(doc_term_matrix)

    # Get and print the topics
    topics = topic_model.get_topics()
    feature_names = vectorizer.get_feature_names_out()  # Get feature names from vectorizer
    corex_topic_words = []
    for i, topic in enumerate(topics):
        words, _, _ = zip(*topic)
        topic_words = [feature_names[idx] for idx in words]  # Convert token IDs to words
        corex_topic_words.append(topic_words)
        print(f"Topic {i+1}: {' '.join(topic_words)}")

    return topic_model, vectorizer, corex_topic_words, preprocessed_tweets

from gensim.models.coherencemodel import CoherenceModel
from gensim import corpora

# get_replaced_tweets() returns a list of strings
# preprocess_text() returns a list of str
# filter_lang() returns a list of strings

def get_coherence_score(topic_model, vectorizer, corex_topic_words, preprocessed_tweets):
    """
    Calculate coherence score for a CorEx topic model.

    Args:
    - topic_model: Trained CorEx topic model.
    - vectorizer: Fitted CountVectorizer used for text preprocessing.
    - corex_topic_words (list): List of lists containing top words for each topic.
    - preprocessed_tweets (list): List of preprocessed tweet texts.

    Returns:
    - coherence_score: Coherence score for the given topic model.
    """
    # Tokenize the text 
    preprocessed_tweets_tokens = [text.split() for text in preprocessed_tweets]

    # Creating the term dictionary, where every unique term is assigned an index
    dictionary = corpora.Dictionary(preprocessed_tweets_tokens)

    # Creating corpus using dictionary prepared above
    corpus = [dictionary.doc2bow(doc) for doc in preprocessed_tweets_tokens]

    # Get coherence score
    cm_corex = CoherenceModel(topics=corex_topic_words, texts=preprocessed_tweets_tokens, corpus=corpus, dictionary=dictionary, coherence='c_v')

    return cm_corex.get_coherence()

get_coherence_score(*get_corex_model(get_replaced_tweets(), preprocess_text_m2))