juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

Training, evaluating, and interpreting topic models | Julia Silge #13

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Training, evaluating, and interpreting topic models | Julia Silge

At the beginning of this year, I wrote a blog post about how to get started with the stm and tidytext packages for topic modeling. I have been doing more topic modeling in various projects, so I wanted to share some workflows I have found useful for

https://juliasilge.com/blog/evaluating-stm/

Roozbeh-you commented 3 years ago

Can you please explain more about making NLP parallel? I have hard time understanding the part you are using furrr package. I am trying to mimic what you've done, but at the same time I want to haver a good understanding of the concept. Your clarification is appreaciated

juliasilge commented 3 years ago

The furrr package lets you use parallel processing wherever you would have used map() before, so replacing map() with future_map(). In this case, the different topic models with different numbers of topics K are independent of each other and can each be trained in parallel.

waldeinsamkeite commented 3 years ago

Hi Julia! Thanks for your super helpful post!

I'm working with the sim package and trying to figure out how to do the findThoughts function. It there any "tidy" way to find the documents mostly relevant to each topic? Thanks!

juliasilge commented 3 years ago

Yes @waldeinsamkeite, the findThoughts() function returns the top documents ranked by the topic's document-topic probabilities, so that's the same as this dataframe. You can go straight to the probabilities and do exactly the ranking and filtering you want yourself using tidy data principles. If you want to apply other conditions like in findThoughts(), IMO that is even easier using this approach.

waldeinsamkeite commented 3 years ago

Hi @juliasilge I've been trying so many methods to figure that out. But I didn't realize findThoughts() and tidy(model, matrix = "gamma") are doing the same thing, except that findThoughts() returns the quote directly. Thanks so much!!!

aridf commented 3 years ago

This post has been incredibly helpful.

annalundsoe commented 2 years ago

Hi @juliasilge Thank you for a great post. I am interested in comparing topic prevalence for two groups (Like Barbera et al, 2019 fig 1, but they are using LDA). So two plots like your 'Top 20 topics by prevalence..'-plot, but the metadata from the topic model are lost, when I run tidy(STM, matrix ="gamma").

Do you know of a neat way to keep the metadata (or just one column of metadata) so I can use it to make two separate plots showing the topic prevalence? Thanks!

juliasilge commented 2 years ago

@annalundsoe You'll want to join the dataframe that has the per-document-per-topic probabilities to any document-level information you have. This is one of the huge benefits of using tidy data principles; you have your data in a flexible format that is amenable to further analysis. The gamma dataframe has your document IDs, so you can join back up with any other dataframe that has document information.

Edited to add a few more resources. You might want to look into estimateEffect() from the stm package (which does have a tidy() method), or using document-level covariates in your stm model.

annalundsoe commented 2 years ago

Oh great thanks!

Yes, I've been using estimateEffect() + summary(), although I'm unsure how to interpret the output! How do you interpret negative coefficient values for the model?

I've tried to look at the vignette and ?estimateEffect, but either it is not there or I am not fully grasping it.

juliasilge commented 2 years ago

@annalundsoe It's like coefficients in a linear model, basically, so the effects are positive or negative with each covariate. Check out this worked example.

annalundsoe commented 2 years ago

@juliasilge Yes, but more intuitively, let's say the covariate is gender, what does -1,2 mean? That a topic is 120 pct. less prevalent in male authors' texts?

marcburri commented 2 years ago

Hi, great post! Could you explain what the diagnostics "lower bound" tells us about the various numbers of topics?

juliasilge commented 2 years ago

That lower bound is about whether or how quickly the model converged. If you want to really get into the nitty gritty of this (complex model), I recommend reading the paper for it, which goes into those kinds of details.

mubagriyanik commented 2 years ago

Hi Julia, thank you so much for this great post!

Although in former comments, you mentioned that gamma matrix gives us similar results with findThoughts(), I wan to get top documents related to those topics and to get one information about those topics. Because in my research I am planning to change name of topics as new labels related to the my research area. How can I get some number of documents for specific topic in my topic model? Thank you! Mu

juliasilge commented 2 years ago

@muharrembb Sounds like you want the highest probability documents for each topic, which is, like you mentioned the gamma matrix. You can use arrange() to sort that dataframe and find the documents with high probabilities for each topic.

MLDavies commented 2 years ago

Great stuff. How about topics over time? Did you have build a tutorial that trains a model with topics that can vary over time or with some continuous or categorical characteristic?

juliasilge commented 2 years ago

@MLDavies I don't have a full-fledged tutorial on this, but you can check out:

IcarusAE commented 2 years ago

Hi Julia, in case you read this: As always, I am delighted to read this tutorial. Unfortunately, I wasn't able to run the many-models-stm. I saw a stackoverflow question (https://cutt.ly/8UQxiSv) experiencing exactly the same issue but even the answer/solution someone presented does not work. Would it be possible to update the material presented? Second, how long is your expected manymodels-stm to run/how long did it take on your laptop? Best regards, Holger

juliasilge commented 2 years ago

@IcarusAE Let me see if I can put together a more updated blog post with some new data sometime soon. I have worked with the future package and if I remember correctly, what you are seeing is a warning, not an error. I believe it should still run OK but there is the uncertainty about whether random numbers have been generated correctly.

It's been a while since I wrote this blog post but it does take quite a while to train all the models on such a large corpus. Depending on your computer, cores, etc, it could well take something like 10-20 minutes, or maybe more.

IcarusAE commented 2 years ago

Dear Julia, thank you for the reply. In the meantime, I found out, that the code needs a further argument (.options = furrr_options(seed = TRUE)) that prompts setting correct random numbers. I did not try that yet to your hackers example. However, after I spent the last week reading and playing around with every piece of code you somewhere presented :) I integrated all the pieces and adapt the different approaches to your spice girl example. I created a markdown doc (see link) that starts with your blog post and adds the other pieces (adding predictors when estimating the initial topic model, testing the model across the range of K's and adding some other helpful things. Perhaps someone here finds that helpful.

In case, there are errors (e.g., in my interpretation), I would welcome form feedback (as I said, I began one week ago).

All the best, Holger (@HolgerSteinmetz on Twitter)

https://htmlpreview.github.io/?https://github.com/IcarusAE/Mixed-open/blob/master/Topic-modeling.html

louislegum commented 2 years ago

Hi Julia, Thanks for this incredible resource!
I was wondering when creating the "many_models" object, do we include the entire code for our STM including meta-data covariates ?

all the best, Louis

juliasilge commented 2 years ago

@louislegum Yes, if you wanted to have a more complex model, you would do something like future_map(K, ~stm(hacker_news_sparse, K = ., prevalence = ~s(Year), data = covariates)).

tynnie commented 2 years ago

Hi Julia, Thank you for sharing this great post!

Can you explain more about why you taking gamma values to inspect the topic proportions?

I mean, the STM topic modeling result provides theta values of each document and we can find out what topic dominates a document by comparing theta values. Then, we can calculate the document-topic proportions. (reference)

I have compared the topic proportions calculated by gamma values and by theta values, though the distribution of the topics was almost the same, there were some slight differences. (as this table)

I really want to know what causes these differences and figure out what to use when talking about document-topic proportions. I’d appreciate it if you could share your thoughts about this issue.

Thanks!

juliasilge commented 2 years ago

@tynnie My understanding is that this is just a difference in nomenclature. The gamma and theta matrices are the same, just two names for the same thing. You can see how I documented this here, and how this is implemented in code here.

tynnie commented 2 years ago

Many thanks! These documents are really helpful.

sheingate commented 2 years ago

Great post, thank you! Is there a way I can examine the coherence and exclusivity of specific topics, for instance by generating a table rather the figure?

juliasilge commented 2 years ago

@sheingate I believe so, yes. Take a look at the k_result object like what I show in this post; it has nested exclusivity and semantic coherence for each model.

oguzozbay commented 1 year ago

Hi, and thank you very much for usefull explanations.

I have a question about below comnet:

juliasilge commented [on May 22, 2022]

https://github.com/juliasilge/juliasilge.com/issues/13#issuecomment-1133810069

future_map(K, ~stm(hacker_news_sparse, K = ., prevalence = ~s(Year), data = covariates))

Is it possible to add docvars to hacker_news_sparse, in a way something like below, instead of using "data = covariates"

hacker_news_sparse$covariates <- Covariate_data_frame$covariates # assining a column as docvar

juliasilge commented 1 year ago

@oguzozbay No, I don't believe so. The data passed in as the first argument documents is only meant to be the text data, not other covariates.

SarahRWarren commented 1 year ago

Hi, Julia! Thanks for a super helpful post. I used this code successfully about a year ago to estimate a topic model on some forum posts. After some updates to the furrr package, the code no longer runs and, when it doesn't throw an error, it crashes my computer. Are you aware of a workaround?

(Note: same thing is happening to colleagues using a similar procedure on different datasets.)

The multiprocess function was phased out, so I switched to multisession. I also switched from a K=(20,40,60,80,100) to a continuous K = (5:100). This has made it so that the code runs, but it also crashes my computer.

df <- read_rds("data/df.Rds")

tidy_forum <- df %>%
  unnest_tokens(word, text_noquote, token = "tweets") %>%
  anti_join(get_stopwords()) %>%
  filter(!str_detect(word, "[0-9]+")) %>%
  add_count(word)
#we use "tweets" here because it's a good workhorse token for processing all kinds of forum data

forum_sparse <- tidy_forum %>%
  count(id, word) %>%
  cast_sparse(id, word, n)

#breaking point
plan(multisession)
many_models <- tibble(K = 5:100) %>%
  mutate(topic_model = future_map(K, ~stm(forum_sparse, K=K), 
                                  seed = TRUE))
juliasilge commented 1 year ago

@SarahRWarren It looks like you need to make a few changes to your code based on updates to future and furrr:

library(tidyverse)
library(tidytext)
library(furrr)
#> Loading required package: future
library(stm)
#> stm v1.3.6 successfully loaded. See ?stm for help. 
#>  Papers, resources, and other materials at structuraltopicmodel.com

tidy_austen <- 
    janeaustenr::austen_books() %>%
    mutate(id = row_number()) %>%
    unnest_tokens(word, text) %>%
    anti_join(get_stopwords()) %>%
    add_count(word) %>%
    filter(n > 10)
#> Joining with `by = join_by(word)`

sparse_austen <- 
    tidy_austen %>%
    count(id, word, name = "word_count") %>%
    cast_sparse(id, word, word_count)

plan(multisession, workers = 3)
tibble(K = c(4, 8, 12)) %>%
    mutate(topic_model = future_map(
        ## notice that it is `K = .` here:
        K, ~ stm(sparse_austen, K = ., verbose = FALSE), 
        ## new way to pass seed arg:
        .options = furrr_options(seed = TRUE)
    ))
#> # A tibble: 3 × 2
#>       K topic_model
#>   <dbl> <list>     
#> 1     4 <STM>      
#> 2     8 <STM>      
#> 3    12 <STM>

Created on 2023-04-25 with reprex v2.0.2

If that doesn't fix it, I am guessing your problem may be because of differences in how memory is managed with the different non-sequential evaluation strategies. Are you running out of memory? Are you on Windows? Probably not, if you were using multiprocess before; I believe that multicore would be the same as what you used before. You can read more about these differences here:

You may see a warning about using multicore in RStudio, but you can turn the warning off if you're comfortable with the tradeoffs there.