Open utterances-bot opened 3 years ago
@Suania8 you have probably seen this kind of analysis, where you look at correlation networks of words as a whole. One idea I have, if you are interested in combining that with topic modeling, is to "assign" each word to the topic it has the highest probability of being generated from and then use that as the color of points in a network diagram.
A challenge with this is that topic modeling models documents as a mixture of topics and topics as a mixture of words, so words can be generated by multiple topics.
@juliasilge thank you so much for sharing this work. I applied it to my dataset. can you please advise how I can get more information about each topic? such as number of words in each topic or all words in each topic?
@BehnamCA I suggest you check out this chapter as well as these sections for more details on how to get out this kind of information.
@juliasilge thanks so much for the quick reply. I will study the mentioned materials. Much appreciated!
hi Julia! I have a question. Once the topics have been extracted, how can they be linked to the entire dataframe for further analysis?
Il giorno lun 4 ott 2021 alle ore 19:44 Behnam Vand < @.***> ha scritto:
@juliasilge https://github.com/juliasilge thanks so much for the quick reply. I will study the mentioned materials. Much appreciated!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/17#issuecomment-933710039, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANMYOWY2RSYF6FCSFAKNPITUFHRYPANCNFSM42QYSYOA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Suania Acampa PhD Fellow in Statistic and Social Sciences University of Naples, Federico II
@Suania8 I suggest you check out this chapter as well as these sections for more details on how to join the per-topic info to the per-document dataframe.
Hi Julia, thank you very much for this tutorial and your video. They are super helpful. I have a question about adding a predictor to my model. I was able to identify 7 topics in 4000+ news stories from 8 news organizations. And now I want to divide 8 media organizations into two groups--four liberal and four conservative media and see the frequency of each topic in the liberal and conservative group. (For example, I want to see whether Topic 1 was used more often in the liberal media than the conservative media.) Do you have any ideas about adding a predictor to my model (to compare liberal vs. conservative ones)?
@kgh21 You have several options for this kind of question. You can check out:
Thank you for the information!
Hi Julia, thanks so much for this tutorial. I was wondering, if I could create a plot chart that shows frex topics instead of high probabilities of the words. Could you give me some advice on this? Thanks in advance!
@rrefining We don't have direct support in tidytext for getting out this info right now, but you can do some munging yourself as I outline here. This involves computing the high FREX words yourself via stm::calcfrex()
, and then doing some transformation so that you can visualize the results.
I'm wondering if you have any guidance on evaluating stm models with different predictors. I did a manual holdout of some documents (esp because there was a grouping structure to my data, so a stratified the holdouts) and then used fitNewDocuments to estimates thetas and phis. However, I have no idea how to comparable different models to each other based on this output. I'd really like to do something like evaluate the heldout likelihood but I don't know how to do that for previously unseen documents. Any suggestions would be much appreciated.
@capplestein I haven't done this myself, but can you manually make something like the output of stm::make.heldout()
with your stratified resampling and then use stm::eval.heldout()
? You might take a look at this blog post, which goes into more detail on evaluating topic models.
I could try this... my understanding was that eval.heldout was based on document completion so probably wouldn't work for previouy unseen documents. Maybe I just need to try to figure out something custom. I basically want to figure out how to calculate likelihood from the output of fitNewDocuments
The game is afoot! Topic modeling of Sherlock Holmes stories | Julia Silge
In a recent release of tidytext, we added tidiers and support for building Structural Topic Models from the stm package. This is my current favorite implementation of topic modeling in R, so let’s walk through an example of how to get started with this kind of modeling, using The Adventures of Sherlock Holmes.
https://juliasilge.com/blog/sherlock-holmes-stm/