3. Topic Modeling - Githubissues

tmozgach commented 6 years ago

You could start with topic modeling first. Dr. Yang was using LDA with five or six methods like SVM etc. I think you can easily google some guides to do it with R or python. This is quite mature now.

tmozgach commented 6 years ago

Not important: An explanation and example of topic modeling with R

http://brazenly.blogspot.ca/2016/05/r-text-classification-and-topic_1.html

tmozgach commented 6 years ago

Main guide that I followed: https://github.com/abhijeet3922/Topic-Modelling-on-Wiki-corpus

tmozgach commented 6 years ago

DONE: It has visualization stuff: https://github.com/shichaoji/easyLDA

tmozgach commented 6 years ago

Tips and tuning parameters: https://markroxor.github.io/gensim/static/notebooks/lda_training_tips.html

tmozgach commented 6 years ago

Based on the following: convergence_liklihood.pdf We need 900-1000 iterations.

The graph above I produced using: How to monitor convergence of Gensim LDA model? https://stackoverflow.com/questions/37570696/how-to-monitor-convergence-of-gensim-lda-model

Select iterations and passes parameters of LDA model:

I suggest the following way to choose iterations and passes. First, enable logging (as described in many Gensim tutorials), and set eval_every = 1 in LdaModel. When training the model look for a line in the log that looks something like this:

2016-06-21 15:40:06,753 - gensim.models.ldamodel - DEBUG - 68/1566 documents converged within 400 iterations

If you set passes = 20 you will see this line 20 times. Make sure that by the final passes, most of the documents have converged. So you want to choose both passes and iterations to be high enough for this to happen.

tmozgach commented 6 years ago

@neowangkkk First result from Cedar, just for Title and Post (no COMMENTS), 10 topics, took 8 hours, download all files and run HTML file: [Uploading cedar_data_10topics.zip…]() How to interpret visualization (last section): https://nlpforhackers.io/topic-modeling/

tmozgach commented 6 years ago

@neowangkkk 20 Topics: 20Topics.zip

Working on 40 topics...

tmozgach commented 6 years ago

@neowangkkk 40 Topics: 40Topics.zip

Working on 30

tmozgach commented 6 years ago

@neowangkkk 30Topics.zip

tmozgach commented 6 years ago

@neowangkkk 10, 20, 30, 40 Topics for Titles, Posts and their COMMENTS (10to40TopicsForALL.zip): https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ?usp=sharing

neowangkkk commented 6 years ago

R Code for importing data

set your working driectory

setwd("/Users/Tao/Dropbox/Data/Reddit_data/")

install tydyverse package

install.packages("tidyverse") library(tidyverse)

read file using read_csv

data<-read_csv(file = "data_full.csv")

check summary stats

summary(data)

tmozgach commented 6 years ago

@neowangkkk New result on your friend's data: https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ?usp=sharing NewAllData.zip

neowangkkk commented 6 years ago

can you please replace these:

ideas = idea product = products cars = car entrepreneurs = entrepreneur

can you please remove these:

doesnt didnt theyre isnt business work start

tmozgach commented 6 years ago

@neowangkkk https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ 10-30.zip

I am not sure that it was good idea to exclude a lot of words. It seems that it influences and change topics. I got rid of the following words: (It is ok that some of them are repeated, it doesn't influence on anything.)

awesome cant though theyre yeah around try enough keep way start work busines isnt theyre didnt doesnt i\'ve you\'re that\'s what\'s let\'s i\'d you\'ll aren\'t \"the i\'ll we\'re wont 009 don\'t it\'s nbsp i\'m get make like would want dont\' use one need know good take thank say also see really could much something ive well give first even great things come thats sure help youre lot someone ask best many question etc better still put might actually let love may tell every maybe always never probably anything cant\' doesnt\' ill already able anyone since another theres everything without didn\'t isn\'t youll\' per else ive get would like want hey might may without also make want put etc actually else far definitely youll\' didnt\' isnt\' theres since able maybe without may suggestedsort never isredditmediadomain userreports far appreciate next think know need look please one null take dont dont\' want\' could able ask well best someone sure lot thank also anyone really something give years use make all ago people know many call include part find become

tmozgach / ent_ob

3. Topic Modeling #7

R Code for importing data

set your working driectory

install tydyverse package

read file using read_csv

check summary stats

can you please replace these:

can you please remove these: