Open tmozgach opened 6 years ago
Not important: An explanation and example of topic modeling with R
http://brazenly.blogspot.ca/2016/05/r-text-classification-and-topic_1.html
Main guide that I followed: https://github.com/abhijeet3922/Topic-Modelling-on-Wiki-corpus
DONE: It has visualization stuff: https://github.com/shichaoji/easyLDA
Tips and tuning parameters: https://markroxor.github.io/gensim/static/notebooks/lda_training_tips.html
Based on the following: convergence_liklihood.pdf We need 900-1000 iterations.
The graph above I produced using: How to monitor convergence of Gensim LDA model? https://stackoverflow.com/questions/37570696/how-to-monitor-convergence-of-gensim-lda-model
Select iterations
and passes
parameters of LDA
model:
I suggest the following way to choose iterations and passes. First, enable logging (as described in many Gensim tutorials), and set eval_every = 1 in LdaModel. When training the model look for a line in the log that looks something like this:
2016-06-21 15:40:06,753 - gensim.models.ldamodel - DEBUG - 68/1566 documents converged within 400 iterations
If you set passes = 20 you will see this line 20 times. Make sure that by the final passes, most of the documents have converged. So you want to choose both passes and iterations to be high enough for this to happen.
@neowangkkk First result from Cedar, just for Title and Post (no COMMENTS), 10 topics, took 8 hours, download all files and run HTML file: [Uploading cedar_data_10topics.zip…]() How to interpret visualization (last section): https://nlpforhackers.io/topic-modeling/
@neowangkkk 20 Topics: 20Topics.zip
Working on 40 topics...
@neowangkkk 40 Topics: 40Topics.zip
Working on 30
@neowangkkk 30Topics.zip
@neowangkkk 10, 20, 30, 40 Topics for Titles, Posts and their COMMENTS (10to40TopicsForALL.zip): https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ?usp=sharing
setwd("/Users/Tao/Dropbox/Data/Reddit_data/")
install.packages("tidyverse") library(tidyverse)
data<-read_csv(file = "data_full.csv")
summary(data)
@neowangkkk New result on your friend's data: https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ?usp=sharing NewAllData.zip
ideas = idea product = products cars = car entrepreneurs = entrepreneur
doesnt didnt theyre isnt business work start
@neowangkkk https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ 10-30.zip
I am not sure that it was good idea to exclude a lot of words. It seems that it influences and change topics. I got rid of the following words: (It is ok that some of them are repeated, it doesn't influence on anything.)
awesome cant though theyre yeah around try enough keep way start work busines isnt theyre didnt doesnt i\'ve you\'re that\'s what\'s let\'s i\'d you\'ll aren\'t \"the i\'ll we\'re wont 009 don\'t it\'s nbsp i\'m get make like would want dont\' use one need know good take thank say also see really could much something ive well give first even great things come thats sure help youre lot someone ask best many question etc better still put might actually let love may tell every maybe always never probably anything cant\' doesnt\' ill already able anyone since another theres everything without didn\'t isn\'t youll\' per else ive get would like want hey might may without also make want put etc actually else far definitely youll\' didnt\' isnt\' theres since able maybe without may suggestedsort never isredditmediadomain userreports far appreciate next think know need look please one null take dont dont\' want\' could able ask well best someone sure lot thank also anyone really something give years use make all ago people know many call include part find become
You could start with topic modeling first. Dr. Yang was using LDA with five or six methods like SVM etc. I think you can easily google some guides to do it with R or python. This is quite mature now.