juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

The game is afoot! Topic modeling of Sherlock Holmes stories | Julia Silge #17

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

The game is afoot! Topic modeling of Sherlock Holmes stories | Julia Silge

In a recent release of tidytext, we added tidiers and support for building Structural Topic Models from the stm package. This is my current favorite implementation of topic modeling in R, so let’s walk through an example of how to get started with this kind of modeling, using The Adventures of Sherlock Holmes.

https://juliasilge.com/blog/sherlock-holmes-stm/

juliasilge commented 3 years ago

@Suania8 you have probably seen this kind of analysis, where you look at correlation networks of words as a whole. One idea I have, if you are interested in combining that with topic modeling, is to "assign" each word to the topic it has the highest probability of being generated from and then use that as the color of points in a network diagram.

A challenge with this is that topic modeling models documents as a mixture of topics and topics as a mixture of words, so words can be generated by multiple topics.

BehnamCA commented 2 years ago

@juliasilge thank you so much for sharing this work. I applied it to my dataset. can you please advise how I can get more information about each topic? such as number of words in each topic or all words in each topic?

juliasilge commented 2 years ago

@BehnamCA I suggest you check out this chapter as well as these sections for more details on how to get out this kind of information.

BehnamCA commented 2 years ago

@juliasilge thanks so much for the quick reply. I will study the mentioned materials. Much appreciated!

Suania8 commented 2 years ago

hi Julia! I have a question. Once the topics have been extracted, how can they be linked to the entire dataframe for further analysis?

Il giorno lun 4 ott 2021 alle ore 19:44 Behnam Vand < @.***> ha scritto:

@juliasilge https://github.com/juliasilge thanks so much for the quick reply. I will study the mentioned materials. Much appreciated!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/17#issuecomment-933710039, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANMYOWY2RSYF6FCSFAKNPITUFHRYPANCNFSM42QYSYOA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Suania Acampa PhD Fellow in Statistic and Social Sciences University of Naples, Federico II

juliasilge commented 2 years ago

@Suania8 I suggest you check out this chapter as well as these sections for more details on how to join the per-topic info to the per-document dataframe.

ghkoo commented 2 years ago

Hi Julia, thank you very much for this tutorial and your video. They are super helpful. I have a question about adding a predictor to my model. I was able to identify 7 topics in 4000+ news stories from 8 news organizations. And now I want to divide 8 media organizations into two groups--four liberal and four conservative media and see the frequency of each topic in the liberal and conservative group. (For example, I want to see whether Topic 1 was used more often in the liberal media than the conservative media.) Do you have any ideas about adding a predictor to my model (to compare liberal vs. conservative ones)?

juliasilge commented 2 years ago

@kgh21 You have several options for this kind of question. You can check out:

ghkoo commented 2 years ago

Thank you for the information!

rrefining commented 1 year ago

Hi Julia, thanks so much for this tutorial. I was wondering, if I could create a plot chart that shows frex topics instead of high probabilities of the words. Could you give me some advice on this? Thanks in advance!

juliasilge commented 1 year ago

@rrefining We don't have direct support in tidytext for getting out this info right now, but you can do some munging yourself as I outline here. This involves computing the high FREX words yourself via stm::calcfrex(), and then doing some transformation so that you can visualize the results.

capplestein commented 1 year ago

I'm wondering if you have any guidance on evaluating stm models with different predictors. I did a manual holdout of some documents (esp because there was a grouping structure to my data, so a stratified the holdouts) and then used fitNewDocuments to estimates thetas and phis. However, I have no idea how to comparable different models to each other based on this output. I'd really like to do something like evaluate the heldout likelihood but I don't know how to do that for previously unseen documents. Any suggestions would be much appreciated.

juliasilge commented 1 year ago

@capplestein I haven't done this myself, but can you manually make something like the output of stm::make.heldout() with your stratified resampling and then use stm::eval.heldout()? You might take a look at this blog post, which goes into more detail on evaluating topic models.

capplestein commented 1 year ago

I could try this... my understanding was that eval.heldout was based on document completion so probably wouldn't work for previouy unseen documents. Maybe I just need to try to figure out something custom. I basically want to figure out how to calculate likelihood from the output of fitNewDocuments