tmozgach / ent_ob

Entrepreneur’s online behavior
1 stars 0 forks source link

3. Topic Modeling #7

Open tmozgach opened 6 years ago

tmozgach commented 6 years ago

You could start with topic modeling first. Dr. Yang was using LDA with five or six methods like SVM etc. I think you can easily google some guides to do it with R or python. This is quite mature now.

tmozgach commented 6 years ago

Not important: An explanation and example of topic modeling with R

tmozgach commented 6 years ago

Main guide that I followed:

tmozgach commented 6 years ago

DONE: It has visualization stuff:

tmozgach commented 6 years ago

Tips and tuning parameters:

tmozgach commented 6 years ago

Based on the following: convergence_liklihood.pdf We need 900-1000 iterations.

The graph above I produced using: How to monitor convergence of Gensim LDA model?

Select iterations and passes parameters of LDA model:

I suggest the following way to choose iterations and passes. First, enable logging (as described in many Gensim tutorials), and set eval_every = 1 in LdaModel. When training the model look for a line in the log that looks something like this:

2016-06-21 15:40:06,753 - gensim.models.ldamodel - DEBUG - 68/1566 documents converged within 400 iterations

If you set passes = 20 you will see this line 20 times. Make sure that by the final passes, most of the documents have converged. So you want to choose both passes and iterations to be high enough for this to happen.
tmozgach commented 6 years ago

@neowangkkk First result from Cedar, just for Title and Post (no COMMENTS), 10 topics, took 8 hours, download all files and run HTML file: [Uploading…]() How to interpret visualization (last section):

tmozgach commented 6 years ago

@neowangkkk 20 Topics:

Working on 40 topics...

tmozgach commented 6 years ago

@neowangkkk 40 Topics:

Working on 30

tmozgach commented 6 years ago


tmozgach commented 6 years ago

@neowangkkk 10, 20, 30, 40 Topics for Titles, Posts and their COMMENTS (

neowangkkk commented 6 years ago

R Code for importing data

set your working driectory


install tydyverse package

install.packages("tidyverse") library(tidyverse)

read file using read_csv

data<-read_csv(file = "data_full.csv")

check summary stats


tmozgach commented 6 years ago

@neowangkkk New result on your friend's data:

neowangkkk commented 6 years ago

can you please replace these:

ideas = idea product = products cars = car entrepreneurs = entrepreneur

can you please remove these:

doesnt didnt theyre isnt business work start

tmozgach commented 6 years ago


I am not sure that it was good idea to exclude a lot of words. It seems that it influences and change topics. I got rid of the following words: (It is ok that some of them are repeated, it doesn't influence on anything.)

awesome cant though theyre yeah around try enough keep way start work busines isnt theyre didnt doesnt i\'ve you\'re that\'s what\'s let\'s i\'d you\'ll aren\'t \"the i\'ll we\'re wont 009 don\'t it\'s nbsp i\'m get make like would want dont\' use one need know good take thank say also see really could much something ive well give first even great things come thats sure help youre lot someone ask best many question etc better still put might actually let love may tell every maybe always never probably anything cant\' doesnt\' ill already able anyone since another theres everything without didn\'t isn\'t youll\' per else ive get would like want hey might may without also make want put etc actually else far definitely youll\' didnt\' isnt\' theres since able maybe without may suggestedsort never isredditmediadomain userreports far appreciate next think know need look please one null take dont dont\' want\' could able ask well best someone sure lot thank also anyone really something give years use make all ago people know many call include part find become