juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

Topic modeling for #TidyTuesday Spice Girls lyrics | Julia Silge #58

Open utterances-bot opened 2 years ago

utterances-bot commented 2 years ago

Topic modeling for #TidyTuesday Spice Girls lyrics | Julia Silge

Learn how to train, explore, and understand an unsupervised topic model for text data.

https://juliasilge.com/blog/spice-girls/

cr1bt commented 2 years ago

Hi Julia, thanks for this post! I'm an undergraduate student doing some self learning so forgive me if I've missed something obvious.

At the end where we've done regression, it appears to only be comparing the albums Spice and Spiceworld. I can't see comparisons for the album Forever, which is included in our original dataset.

Am I missing something?

juliasilge commented 2 years ago

@cr1bt When you fit a linear regression with a factor predictor, you get out coefficients for that predictor that are with respect to a reference level. It is most commonly the first level alphabetically, unless you do something special to the variable ahead of time. Check out this section of our book and this SO answer for some more in-depth explanation.

gunnergalactico commented 2 years ago

Hi Julia, this is more a general question. I was trying to search for another blog post for one of your analysis but it seems the search option is no longer available after the redesign. My apologies if i missed it. I tried it on desktop as well.

Thanks.

juliasilge commented 2 years ago

@gunnergalactico I did recently move my blog away from the Academic Hugo theme, which had support for a search bar, to the Apero Hugo theme, which does not yet. I'll look into how to support that! In the meantime, you can search a single site like mine from Google, like:

site:juliasilge.com rpart 
JoshuaSteele commented 2 years ago

Hi Julia, I was able to reproduce everything following your examples and wanted to try to produce similar analyses with a database of Taylor Swift's song lyrics for fun. I can reproduce everything until the estimateEffect function, at which point I get an error stating

Error in UseMethod("asSTMCorpus") : no applicable method for 'asSTMCorpus' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"

I assume referring to the line "tidy_lyrics %>% distinct(Title, Album) %>% arrange(Title)" (different column names for this T Swift dataset) as the 3rd argument in the estimateEffect function.

The thing is, in your example, the result of tidy_lyrics %>% distinct(song_name, album_name) %>% arrange(song_name) is of class c('tbl_df', 'tbl', 'data.frame'), isn't it? I'm not sure what's different.

I know this isn't anywhere near a reprex but thought maybe you might have an idea what kind of thing would cause the estimateEffect function to produce that error.

juliasilge commented 2 years ago

@JoshuaSteele Hmmmm, nothing comes to mind immediately for this. I'd look carefully at the arguments you are passing in to estimateEffect() and make sure they don't have any problems/unexpected characteristics.

JoshuaSteele commented 2 years ago

Ah, well I'll keep trying my other debugging methods then. Thanks for the response!

JoshuaSteele commented 2 years ago

I am so dumb. I was using the %>% operator instead of the assignment <- for the effects <- estimateEffect line. I realized it as I was combing through the estimateEffect parameters. One small typo.

juliasilge commented 2 years ago

@JoshuaSteele Typos strike again! 😭

xinzhuohkust commented 1 year ago

Hi, Julia. I am a big fan of your blog. Thank you so much for your sharing. I has a question about the application of stm package. I have a news database covered a month among which has a policy shock. Is it possible to combine difference-in-differences (DID) and structural topic models?

juliasilge commented 1 year ago

@xinzhuohkust Yep, I believe so! If I understand it correctly, the typical way to model DID is to use an interaction term, and the stm package allows you to build a topic model with interaction terms. I recommend you check out the stm paper!

xinzhuohkust commented 1 year ago

Thank you so much for your quick response!

Sorry for bothering you again.

I have fitted a topic model with 80 topics utilizing stm package. I am using document-topic distributions as outcome variables and run a regression using lm or plm function other than estimateEffect.

as_tibble(topic_model$theta) %>% set_names(nm = sprintf("topic%s", 1:80)) %>% add_column(covariates) %>% lm(topic75 ~ democracy + day + country_name, .) %>% summary()

The regression result is different from: estimateEffect(c(75) ~ democracy + day + country_name, topic_model, meta = covariates)

I was wondering if you could tell me whether I am doing the right thing?

Stay safe and be well!

juliasilge commented 1 year ago

@xinzhuohkust Those are two different models, and I think I would probably use estimateEffect() in most situations, rather than specifying such models using lm(). One significant difference is how estimateEffect() incorporates the uncertainty (from the topic model) in the outcome. Check out the detailed documentation at ?estimateEffect (especially the Details) and the info on estimateEffect() in the stm vignette/paper.

gcm31 commented 1 year ago

Hi! Thank you so much Julia for your videos and tutorials. I am applying your tutorial to trace changes in topics over time in journal articles. I have five decades (more than 9 million tokens) and I am thinking in each decade as an equivalent to the "albums" and each document as an equivalent to the songs of the example. My model has a k of 25.

The point is, when I execute estimateEffect I have this error: Error in qr.lm(thetasims[, k], qx) : number of covariate observations does not match number of docs. My code is: estimateEffect( 1:25 ~ decade, topic_model_corpus, total_corpus %>% distinct (id, decade)%>% arrange (id) )

I am using as metadata a tidy data frame with this structure:

A tibble: 9,134,631 × 3

decade id word

1 decade1 1969-crxwx.txt introduction 2 decade1 1969-crxwx.txt considerable 3 decade1 1969-crxwx.txt studies 4 decade1 1969-crxwx.txt conducted 5 decade1 1969-crxwx.txt nature 6 decade1 1969-crxwx.txt memory 7 decade1 1969-crxwx.txt half 8 decade1 1969-crxwx.txt century 9 decade1 1969-crxwx.txt research 10 decade1 1969-crxwx.txt seldom # … with 9,134,621 more rows # ℹ Use `print(n = ...)` to see more rows Do you know what can I do to fix this? I appreciate any insight!
juliasilge commented 1 year ago

@gcm31 It's hard to know for sure without access to your data, but the error message "number of covariate observations does not match number of docs" indicates that what you are passing in as covariates doesn't have the same number of documents as what was in your model. Can you create a reprex (a minimal reproducible example) for this? The goal of a reprex is to make it easier for people to recreate your problem so that they can understand it and/or fix it. Once you have a reprex, I recommend posting on RStudio Community, which is a great forum for getting help with these kinds of modeling questions. Thanks! 🙌

gcm31 commented 1 year ago

Thank you so much for your answer, Julia! I think that I found the problem. When I did the sparse matrix, I filtered for tokens used more than 5 times. I will test if that is the problem, otherwise, I'll post the question in RStudio Community. Thanks!!