This is a collection documenting the resources I find related to topic models with an R flavored focus. A topic model is a type of generative model used to "discover" latent topics that compose a corpus or collection of documents. Typically topic modeling is used on a collection of text documents but can be used for other modes including use as caption generation for images.
This is my run down of the minimal readings, websites, videos, & scripts the reader needs to become familiar with topic modeling. The list is in an order I believe will be of greatest use and contains a nice mix of introduction, theory, application, and interpretation. As you want to learn more about topic modeling, the other sections will become more useful.
Papadimitriou, Raghavan, Tamaki & Vempala, Santosh (1997) first introduced the notion of topic modeling in their "Latent Semantic Indexing: A probabilistic analysis". Thomas Hofmann (1999) developed "Probabilistic latent semantic indexing". Blei, Ng, & Jordan (2003) proposed latent Dirichlet allocation (LDA) as a means of modeling documents with multiple topics but assumes the topic are uncorrelated. Blei & Lafferty (2007) proposed correlated topics model (CTM), extending LDA to allow for correlations between topics. Roberts, Stewart, Tingley, & Airoldi (2013) propose a Structural Topic Model (STM), allowing the inclusion of meta-data in the modeling process.
Marwick, B. 2013. Discovery of Emergent Issues and Controversies in Anthropology Using Text Mining, Topic Modeling, and Social Network Analysis of Microblog Content. In Yanchang Zhao, Yonghua Cen (eds) Data Mining Applications with R. Elsevier. p. 63-93
Newman, D.J. & Block, S. (2006). Probabilistic topic decomposition of an eighteenth-century American newspaper. Journal of the American Society for Information Science and Technology. 57(6), 753-767. <doi:10.1002/asi.v57:6>
Package | Functionality | Pluses | Author | R Language Interface |
---|---|---|---|---|
lda* | Collapsed Gibbs for LDA | Graphing utilities | Chang | R |
topicmodels | LDA and CTM | Follows Blei's implementation; great vignette; takes | C | DTM |
stm | Model w/ meta-data | Great documentation; nice visualization | Roberts, Stewart, & Tingley | C |
LDAvis | Interactive visualization | Aids in model interpretation | Sievert & Shirley | R + Shiny |
mallet** | LDA | MALLET is well known | Mimno | Java |
*StackExchange discussion of lda vs.
topicmodels
**Setting Up
MALLET
The .R script for this demonstration can be downloaded from scripts/Example_topic_model_analysis.R
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/gofastr")
pacman::p_load(tm, topicmodels, dplyr, tidyr, igraph, devtools, LDAvis, ggplot2)
## Source topicmodels2LDAvis & optimal_k functions
invisible(lapply(
file.path(
"https://raw.githubusercontent.com/trinker/topicmodels_learning/master/functions",
c("topicmodels2LDAvis.R", "optimal_k.R")
),
devtools::source_url
))
## SHA-1 hash of file is 5ac52af21ce36dfe8f529b4fe77568ced9307cf0
## SHA-1 hash of file is 7f0ab64a94948c8b60ba29dddf799e3f6c423435
data(presidential_debates_2012)
stops <- c(
tm::stopwords("english"),
tm::stopwords("SMART"),
"governor", "president", "mister", "obama","romney"
) %>%
gofastr::prep_stopwords()
doc_term_mat <- presidential_debates_2012 %>%
with(gofastr::q_dtm_stem(dialogue, paste(person, time, sep = "_"))) %>%
gofastr::remove_stopwords(stops, stem=TRUE) %>%
gofastr::filter_tf_idf() %>%
gofastr::filter_documents()
control <- list(burnin = 500, iter = 1000, keep = 100, seed = 2500)
The plot below shows the harmonic mean of the log likelihoods against k (number of topics).
(k <- optimal_k(doc_term_mat, 40, control = control))
##
## Grab a cup of coffee this could take a while...
## 10 of 40 iterations (Current: 08:54:32; Elapsed: .2 mins)
## 20 of 40 iterations (Current: 08:55:07; Elapsed: .8 mins; Remaining: ~2.3 mins)
## 30 of 40 iterations (Current: 08:56:03; Elapsed: 1.7 mins; Remaining: ~1.3 mins)
## 40 of 40 iterations (Current: 08:57:30; Elapsed: 3.2 mins; Remaining: ~0 mins)
## Optimal number of topics = 20
It appears the optimal number of topics is ~k = 20.
control[["seed"]] <- 100
lda_model <- topicmodels::LDA(doc_term_mat, k=as.numeric(k), method = "Gibbs",
control = control)
topics <- topicmodels::posterior(lda_model, doc_term_mat)[["topics"]]
topic_dat <- dplyr::add_rownames(as.data.frame(topics), "Person_Time")
colnames(topic_dat)[-1] <- apply(terms(lda_model, 10), 2, paste, collapse = ", ")
tidyr::gather(topic_dat, Topic, Proportion, -c(Person_Time)) %>%
tidyr::separate(Person_Time, c("Person", "Time"), sep = "_") %>%
dplyr::mutate(Person = factor(Person,
levels = c("OBAMA", "ROMNEY", "LEHRER", "SCHIEFFER", "CROWLEY", "QUESTION" ))
) %>%
ggplot2::ggplot(ggplot2::aes(weight=Proportion, x=Topic, fill=Topic)) +
ggplot2::geom_bar() +
ggplot2::coord_flip() +
ggplot2::facet_grid(Person~Time) +
ggplot2::guides(fill=FALSE) +
ggplot2::xlab("Proportion")
heatmap(topics, scale = "none")
post <- topicmodels::posterior(lda_model)
cor_mat <- cor(t(post[["terms"]]))
cor_mat[ cor_mat < .05 ] <- 0
diag(cor_mat) <- 0
graph <- graph.adjacency(cor_mat, weighted=TRUE, mode="lower")
graph <- delete.edges(graph, E(graph)[ weight < 0.05])
E(graph)$edge.width <- E(graph)$weight*20
V(graph)$label <- paste("Topic", V(graph))
V(graph)$size <- colSums(post[["topics"]]) * 15
par(mar=c(0, 0, 3, 0))
set.seed(110)
plot.igraph(graph, edge.width = E(graph)$edge.width,
edge.color = "orange", vertex.color = "orange",
vertex.frame.color = NA, vertex.label.color = "grey30")
title("Strength Between Topics Based On Word Probabilities", cex.main=.8)
minval <- .1
topic_mat <- topicmodels::posterior(lda_model)[["topics"]]
graph <- graph_from_incidence_matrix(topic_mat, weighted=TRUE)
graph <- delete.edges(graph, E(graph)[ weight < minval])
E(graph)$edge.width <- E(graph)$weight*17
E(graph)$color <- "blue"
V(graph)$color <- ifelse(grepl("^\\d+$", V(graph)$name), "grey75", "orange")
V(graph)$frame.color <- NA
V(graph)$label <- ifelse(grepl("^\\d+$", V(graph)$name), paste("topic", V(graph)$name), gsub("_", "\n", V(graph)$name))
V(graph)$size <- c(rep(10, nrow(topic_mat)), colSums(topic_mat) * 20)
V(graph)$label.color <- ifelse(grepl("^\\d+$", V(graph)$name), "red", "grey30")
par(mar=c(0, 0, 3, 0))
set.seed(369)
plot.igraph(graph, edge.width = E(graph)$edge.width,
vertex.color = adjustcolor(V(graph)$color, alpha.f = .4))
title("Topic & Document Relationships", cex.main=.8)
The output from LDAvis is not easily embedded within an R markdown document, however, the reader may see the results here.
lda_model %>%
topicmodels2LDAvis() %>%
LDAvis::serVis()
## Create the DocumentTermMatrix for New Data
doc_term_mat2 <- partial_republican_debates_2015 %>%
with(gofastr::q_dtm_stem(dialogue, paste(person, location, sep = "_"))) %>%
gofastr::remove_stopwords(stops, stem=TRUE) %>%
gofastr::filter_tf_idf() %>%
gofastr::filter_documents()
## Update Control List
control2 <- control
control2[["estimate.beta"]] <- FALSE
## Run the Model for New Data
lda_model2 <- topicmodels::LDA(doc_term_mat2, k = k, model = lda_model,
control = list(seed = 100, estimate.beta = FALSE))
## Plot the Topics Per Person & Location for New Data
topics2 <- topicmodels::posterior(lda_model2, doc_term_mat2)[["topics"]]
topic_dat2 <- dplyr::add_rownames(as.data.frame(topics2), "Person_Location")
colnames(topic_dat2)[-1] <- apply(terms(lda_model2, 10), 2, paste, collapse = ", ")
tidyr::gather(topic_dat2, Topic, Proportion, -c(Person_Location)) %>%
tidyr::separate(Person_Location, c("Person", "Location"), sep = "_") %>%
ggplot2::ggplot(ggplot2::aes(weight=Proportion, x=Topic, fill=Topic)) +
ggplot2::geom_bar() +
ggplot2::coord_flip() +
ggplot2::facet_grid(Person~Location) +
ggplot2::guides(fill=FALSE) +
ggplot2::xlab("Proportion")
## LDAvis of Model for New Data
lda_model2 %>%
topicmodels2LDAvis() %>%
LDAvis::serVis()
You are welcome to: