kwartler / text_mining

This repo contains data from Ted Kwartler's "Text Mining in Practice With R" book.
53 stars 67 forks source link

LDA and tf-idf document term matrix #1

Closed TheOne000 closed 6 years ago

TheOne000 commented 7 years ago

Dear Ted

Question: Can we input tf-idf document term matrix into Latent Dirichlet Allocation (LDA)? if yes, how?

it does not work in my case and the LDA function requires the 'term-frequency' document term matrix.

Thank you (I make a question as concise as possible. So, if you need more details, I can add

##########################################################################
                           TF-IDF Document matrix construction
##########################################################################    

> DTM_tfidf <-DocumentTermMatrix(corpora,control = list(weighting = 
function(x)+   weightTfIdf(x, normalize = FALSE)))
> str(DTM_tfidf)
List of 6
$ i       : int [1:4466] 1 1 1 1 1 1 1 1 1 1 ...
$ j       : int [1:4466] 6 10 22 26 28 36 39 41 47 48 ...
$ v       : num [1:4466] 6 2.09 1.05 3.19 2.19 ...
$ nrow    : int 64
$ ncol    : int 297
$ dimnames:List of 2
  ..$ Docs : chr [1:64] "1" "2" "3" "4" ...
  ..$ Terms: chr [1:297] "accommod" "account" "achiev" "act" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency - inverse document 
frequency" "tf-idf"

##########################################################################
                           LDA section
##########################################################################

> LDA_results <-LDA(DTM_tfidf,k, method="Gibbs", control=list(nstart=nstart,
  +                                seed = seed, best=best, 
  +                                burnin = burnin, iter = iter, thin=thin))

##########################################################################
                           Error messages
##########################################################################
  Error in LDA(DTM_tfidf, k, method = "Gibbs", control = list(nstart = 
  nstart,  : 
  The DocumentTermMatrix needs to have a term frequency weighting
kwartler commented 7 years ago

Try this stackoverflow explanation with a workaround. I have never done it myself. Apparently, LDA requires TF not TfIdf because its measuring distributions.
I wouldn't recommend using LDA this way. I suppose you could do some data wrangling to get it into a useable format for LDA but the authors of LDA clearly wants Tf. What exactly are you trying to accomplish?

TheOne000 commented 7 years ago

Hi Ted

 My plan is to use TF-IDF as a tool to take some terms out of the

corpus after the analytical pre-processing. As you know, words (that are not in a list of stop words) with high frequency do not always contribute meaningful information to the document.

   'Term frequency' shows only how frequent the terms appear in the

document, but TF-IDF weights these term with 'rarity'. I would like to clean the corpus in this fashion before applying LDA with the corpus.

Thank you for your elaboration Sapphasak

2017-08-09 16:19 GMT+01:00 kwartler notifications@github.com:

Try this stackoverflow https://stackoverflow.com/questions/33770287/documenttermmatrix-needs-to-have-a-term-frequency-weighting-error explanation with a workaround. I have never done it myself. Apparently, LDA requires TF not TfIdf because its measuring distributions. I wouldn't recommend using LDA this way. I suppose you could do some data wrangling to get it into a useable format for LDA but the authors of LDA clearly wants Tf. What exactly are you trying to accomplish?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kwartler/text_mining/issues/1#issuecomment-321287936, or mute the thread https://github.com/notifications/unsubscribe-auth/AY61H0qwL9EtrwOAkvBS6jCa17IGXFwkks5sWc4JgaJpZM4OwbCn .

kwartler commented 6 years ago

Was giving this some thought and I think you could perform some sort of tf-idf TDM, then apply a heuristic to identify the low quality terms.

library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude,
                          control = list(weighting =
                                           function(x)
                                             weightTfIdf(x, normalize =
                                                           FALSE),
                                         stopwords = TRUE))

dtmM<-as.matrix(dtm)
tfScores<-colSums(dtmM)
tfScores<-data.frame(term=names(tfScores),tfScoring=tfScores)
tfScores<-tfScores[order(tfScores$tfScoring),]

# Then perform a subset based on deciling, or other heuristic for example
drops<- subset(tfScores$term,tfScores$tfScoring<=5) #or change to 0 etc.
drops<-as.character(drops)

drops is a vector of terms that can be concatenated to the stop words list. The example above has no Corpus Cleaning functions applied so you would have to do that before. Then you would have a tfIDF version for LDA.